Tidymodels é um conjunto de bibliotecas que cuida de todos os passos necessários para desenvolver o workflow de seleção e avaliação de modelos de aprendizado estatístico.
O desenvolvimento é financiado pela RStudio e liderado por Max Kuhn, o principal desenvolvedor de uma biblioteca similar mais antiga: caret.
A tidymodels é toda tidy friendly. Essa é uma das diferenças em relação à caret. Ela também é mais completa e possui muito mais funcionalidades.
É possível obter mais informações em tidymodels.org
Tidymodels é formada por pacotes ortogonais.
Este termo é emprestado da matemática. No caso de dois vetores ortogonais, podemos nos mover na direção de um deles sem que nossa projeção no outro seja alterada.
Em programação ou arquitetura de software dizemos que componentes ortogonais são desacoplados: a mudança em um componente não afeta outros. Esta propriedade exige componentes menores e mais coesos, com responsabilidades bem definidas, e permite alterações com menos efeitos colaterais. Um bom livro para quem quer entender como usar conceitos como esse em programação se chama Pragmatic Programmer, de David Thomas e Andrew Hunt.
As bibliotecas que compõem a tidymodels funcionam assim: ao configurar o workflow que vai implementar o processo de treinamento, seleção e avaliação de modelos, várias etapas ortogonais vão ser preparadas com uso de várias bibliotecas.
workflows ajuda a montar todas as etapas do processo de trabalho em uma estrutura de fluxo de trabalho;
recipes permite criar as etapas de pré-processamento, facilitando o processo de feature engineering e sua aplicação a dados fora da amostra;
rsample ajuda a dividir dados em treinamento, teste e validação e provê a infraestrutura de amostragem para realizar processos de cross-validation;
parnsnip contém interfaces genéricas para vários tipos de modelos de aprendizado estatístico;
tune ajuda a criar executar a busca pelo melhor conjunto de hiperparâmetros para um modelo;
dials ajuda a definir valores candidatos para os hiperparâmetros;
yardsticks provê as funcionalidades necessárias para medir a performance dos modelos.
Os dados vieram de um estudo de pesquisadores da Columbia Business School, Ray Fisman and Sheena Iyenga.
Eles fizeram várias rodadas de encontros de 4 minutos entre homens e mulheres heterossexuais.
Várias características foram coletadas, incluindo um veredito final determinando se cada parceiro de encinto gostou do outro.
Os dados foram coletados no site Kaggle
Eles não estão redondos…
## Rows: 8,378
## Columns: 195
## $ iid <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ gender <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ idg <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ condtn <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ round <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10...
## $ position <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ positin1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ order <dbl> 4, 3, 10, 5, 7, 6, 1, 2, 8, 9, 10, 9, 6, 1, 3, 2, 7, 8, 4,...
## $ partner <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
## $ pid <dbl> 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 11, 12, 13, 14, 15...
## $ match <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0...
## $ int_corr <dbl> 0.14, 0.54, 0.16, 0.61, 0.21, 0.25, 0.34, 0.50, 0.28, -0.3...
## $ samerace <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1...
## $ age_o <dbl> 27, 22, 22, 23, 24, 25, 30, 27, 28, 24, 27, 22, 22, 23, 24...
## $ race_o <dbl> 2, 2, 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 4, 2, 3, 2, 2, 2, 2, 2...
## $ pf_o_att <dbl> 35.00, 60.00, 19.00, 30.00, 30.00, 50.00, 35.00, 33.33, 50...
## $ pf_o_sin <dbl> 20.00, 0.00, 18.00, 5.00, 10.00, 0.00, 15.00, 11.11, 0.00,...
## $ pf_o_int <dbl> 20.00, 0.00, 19.00, 15.00, 20.00, 30.00, 25.00, 11.11, 25....
## $ pf_o_fun <dbl> 20.00, 40.00, 18.00, 40.00, 10.00, 10.00, 10.00, 11.11, 10...
## $ pf_o_amb <dbl> 0.00, 0.00, 14.00, 5.00, 10.00, 0.00, 5.00, 11.11, 0.00, 0...
## $ pf_o_sha <dbl> 5.00, 0.00, 12.00, 5.00, 20.00, 10.00, 10.00, 22.22, 15.00...
## $ dec_o <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0...
## $ attr_o <dbl> 6, 7, 10, 7, 8, 7, 3, 6, 7, 6, 8, 7, 10, 9, 10, 7, 5, 7, 8...
## $ sinc_o <dbl> 8, 8, 10, 8, 7, 7, 6, 7, 7, 6, 7, 6, 10, 9, 10, 8, 3, 7, 6...
## $ intel_o <dbl> 8, 10, 10, 9, 9, 8, 7, 5, 8, 6, 6, 10, 10, 9, 10, 7, 4, 7,...
## $ fun_o <dbl> 8, 7, 10, 8, 6, 8, 5, 6, 8, 6, 9, 6, 10, 9, 10, 5, 3, 7, 9...
## $ amb_o <dbl> 8, 7, 10, 9, 9, 7, 8, 8, 8, 6, 7, 6, 10, 9, 7, 7, 5, 7, 8,...
## $ shar_o <dbl> 6, 5, 10, 8, 7, 7, 7, 6, 9, 6, 4, 5, 10, 9, 8, 7, 3, 5, 7,...
## $ like_o <dbl> 7.0, 8.0, 10.0, 7.0, 8.0, 7.0, 2.0, 7.0, 6.5, 6.0, 7.0, 8....
## $ prob_o <dbl> 4, 4, 10, 7, 6, 6, 1, 5, 8, 6, 2, 4, 10, 7, 1, 5, 3, 6, 8,...
## $ met_o <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2...
## $ age <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 24, 24, 24, 24, 24...
## $ field <chr> "Law", "Law", "Law", "Law", "Law", "Law", "Law", "Law", "L...
## $ field_cd <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ undergra <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mn_sat <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ tuition <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ race <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ imprace <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ imprelig <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ from <chr> "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Ch...
## $ zipcode <dbl> 60521, 60521, 60521, 60521, 60521, 60521, 60521, 60521, 60...
## $ income <dbl> 69487, 69487, 69487, 69487, 69487, 69487, 69487, 69487, 69...
## $ goal <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ date <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ go_out <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ career <chr> "lawyer", "lawyer", "lawyer", "lawyer", "lawyer", "lawyer"...
## $ career_c <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sports <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ tvsports <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ exercise <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ dining <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, ...
## $ museums <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ art <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ hiking <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ gaming <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ clubbing <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ reading <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 10, 10, 10, 10, 10, 10, 10, ...
## $ tv <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ theater <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ movies <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 8, ...
## $ concerts <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 7, 7, 7, 7, 7, 7, ...
## $ music <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ shopping <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ yoga <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ exphappy <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ expnum <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ attr1_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 45, 45, 45, 45, 45...
## $ sinc1_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 5, 5, 5, 5, 5, 5, ...
## $ intel1_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 25, 25, 25, 25, 25...
## $ fun1_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20...
## $ amb1_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, ...
## $ shar1_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5, 5, 5, 5, 5, 5, ...
## $ attr4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_1 <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 65, 65, 65, 65, 65...
## $ sinc2_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 0, 0, 0, 0, 0, 0, ...
## $ intel2_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10...
## $ fun2_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 25, 25, 25, 25, 25...
## $ amb2_1 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ shar2_1 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attr3_1 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ fun3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, ...
## $ intel3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ amb3_1 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ attr5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ dec <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1...
## $ attr <dbl> 6, 7, 5, 7, 5, 4, 7, 4, 7, 5, 5, 8, 5, 7, 6, 8, 7, 5, 7, 6...
## $ sinc <dbl> 9, 8, 8, 6, 6, 9, 6, 9, 6, 6, 7, 5, 8, 9, 8, 7, 5, 8, 6, 7...
## $ intel <dbl> 7, 7, 9, 8, 7, 7, 7, 7, 8, 6, 8, 6, 9, 7, 7, 8, 9, 7, 8, 8...
## $ fun <dbl> 7, 8, 8, 7, 7, 4, 4, 6, 9, 8, 4, 6, 6, 6, 9, 3, 6, 5, 9, 7...
## $ amb <dbl> 6, 5, 5, 6, 6, 6, 6, 5, 8, 10, 6, 9, 3, 5, 7, 6, 7, 9, 4, ...
## $ shar <dbl> 5, 6, 7, 8, 6, 4, 7, 6, 8, 8, 3, 6, 4, 7, 8, 2, 9, 5, 5, 8...
## $ like <dbl> 7, 7, 7, 7, 6, 6, 6, 6, 7, 6, 6, 7, 6, 7, 8, 6, 8, 5, 5, 8...
## $ prob <dbl> 6, 5, NA, 6, 6, 5, 5, 7, 7, 6, 4, 3, 7, 8, 6, 5, 7, 6, 6, ...
## $ met <dbl> 2, 1, 1, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, ...
## $ match_es <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ attr1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ satis_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ length <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ numdat_2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, ...
## $ attr7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr1_2 <dbl> 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19...
## $ sinc1_2 <dbl> 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16...
## $ intel1_2 <dbl> 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13...
## $ fun1_2 <dbl> 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22...
## $ amb1_2 <dbl> 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11...
## $ shar1_2 <dbl> 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16...
## $ attr4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_2 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ intel3_2 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ fun3_2 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ amb3_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ attr5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ you_call <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ them_cal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ date_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ numdat_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ num_in_3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr1_3 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30...
## $ sinc1_3 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 5, 5, 5, 5, 5, 5, ...
## $ intel1_3 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 40, 40, 40, 40, 40...
## $ fun1_3 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15...
## $ amb1_3 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, ...
## $ shar1_3 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10...
## $ attr7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_3 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ intel3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ fun3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ amb3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ attr5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
Algumas colunas devem ser renomeadas para nomes mais inteligíveis
dados_speed_date_renomeado <- dados_speed_date %>%
rename(
unique_id_number = iid,
id_within_wave = id,
male = gender,
subject_within_gender = idg,
choice = condtn,
n_people_met_in_wave = round,
position_meeting = position,
position_started = positin1,
order_meeting = order,
partnet_id_within_wave = partner,
partner_unique_id_number =pid ,
interests_correlation = int_corr,
same_race = samerace,
my_age = age,
partner_age = age_o,
partner_race = race_o,
partner_stated_pref_time0_attractive = pf_o_att,
partner_stated_pref_time0_sincere = pf_o_sin,
partner_stated_pref_time0_intelligent = pf_o_int,
partner_stated_pref_time0_fun = pf_o_fun,
partner_stated_pref_time0_ambitious = pf_o_amb,
partner_stated_pref_time0_shared_interests = pf_o_sha,
cod_field = field_cd,
importance_same_race = imprace,
importance_same_religion = imprelig,
place_from = from,
zipcode = zipcode,
income_zipcode = income,
frequency_date = date,
frequency_go_out = go_out,
career_macro = career_c,
happy_expec = exphappy,
n_expect_like_you = expnum,
i_liked_partner = dec,
partner_liked_me = dec_o,
i_found_partner__attractive = attr,
i_found_partner__sincere = sinc,
i_found_partner__intelligent = intel,
i_found_partner__fun = fun,
i_found_partner__ambitious = amb,
i_found_partner__interests = shar,
degree_i_liked_partner = like,
partner_found_me__attractive = attr_o,
partner_found_me__sincere = sinc_o,
partner_found_me__intelligent = intel_o,
partner_found_me__fun = fun_o,
partner_found_me__ambitious = amb_o,
partner_found_me__interests = shar_o,
probability_i_find_partner_liked_me = prob,
met_before = met,
n_matches_you_think = match_es,
satisfaction_with_partners = satis_2,
opinion_duration_of_date = length,
opinion_num_dates = numdat_2,
num_matches_you_called = you_call,
num_matches_called_you = them_cal,
have_you_dated = date_3
)
glimpse(dados_speed_date_renomeado)## Rows: 8,378
## Columns: 195
## $ unique_id_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ male <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ subject_within_gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ n_people_met_in_wave <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number <dbl> 11, 12, 13, 14, 15, 16, ...
## $ match <dbl> 0, 0, 1, 1, 1, 0, 0, 0, ...
## $ interests_correlation <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race <dbl> 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ partner_age <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race <dbl> 2, 2, 4, 2, 3, 2, 2, 2, ...
## $ partner_stated_pref_time0_attractive <dbl> 35.00, 60.00, 19.00, 30....
## $ partner_stated_pref_time0_sincere <dbl> 20.00, 0.00, 18.00, 5.00...
## $ partner_stated_pref_time0_intelligent <dbl> 20.00, 0.00, 19.00, 15.0...
## $ partner_stated_pref_time0_fun <dbl> 20.00, 40.00, 18.00, 40....
## $ partner_stated_pref_time0_ambitious <dbl> 0.00, 0.00, 14.00, 5.00,...
## $ partner_stated_pref_time0_shared_interests <dbl> 5.00, 0.00, 12.00, 5.00,...
## $ partner_liked_me <dbl> 0, 0, 1, 1, 1, 1, 0, 0, ...
## $ partner_found_me__attractive <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ like_o <dbl> 7.0, 8.0, 10.0, 7.0, 8.0...
## $ prob_o <dbl> 4, 4, 10, 7, 6, 6, 1, 5,...
## $ met_o <dbl> 2, 2, 1, 2, 2, 2, 2, 2, ...
## $ my_age <dbl> 21, 21, 21, 21, 21, 21, ...
## $ field <chr> "Law", "Law", "Law", "La...
## $ cod_field <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ undergra <lgl> NA, NA, NA, NA, NA, NA, ...
## $ mn_sat <lgl> NA, NA, NA, NA, NA, NA, ...
## $ tuition <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ importance_same_race <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ place_from <chr> "Chicago", "Chicago", "C...
## $ zipcode <dbl> 60521, 60521, 60521, 605...
## $ income_zipcode <dbl> 69487, 69487, 69487, 694...
## $ goal <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ frequency_date <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ frequency_go_out <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ career <chr> "lawyer", "lawyer", "law...
## $ career_macro <dbl> NA, NA, NA, NA, NA, NA, ...
## $ sports <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ tvsports <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ exercise <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ dining <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ museums <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ art <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ hiking <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ gaming <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ clubbing <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ reading <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ tv <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ theater <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ movies <dbl> 10, 10, 10, 10, 10, 10, ...
## $ concerts <dbl> 10, 10, 10, 10, 10, 10, ...
## $ music <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ shopping <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ yoga <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ happy_expec <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ attr1_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ sinc1_1 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel1_1 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ fun1_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ amb1_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ shar1_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ attr4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_1 <dbl> 35, 35, 35, 35, 35, 35, ...
## $ sinc2_1 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel2_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ fun2_1 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ amb2_1 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ shar2_1 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ attr3_1 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ sinc3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ fun3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ intel3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ amb3_1 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ attr5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ i_liked_partner <dbl> 1, 1, 1, 1, 1, 0, 1, 0, ...
## $ i_found_partner__attractive <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ degree_i_liked_partner <dbl> 7, 7, 7, 7, 6, 6, 6, 6, ...
## $ probability_i_find_partner_liked_me <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before <dbl> 2, 1, 1, 2, 2, 2, 2, NA,...
## $ n_matches_you_think <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ attr1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ satisfaction_with_partners <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ opinion_duration_of_date <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ opinion_num_dates <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ attr7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr1_2 <dbl> 19.44, 19.44, 19.44, 19....
## $ sinc1_2 <dbl> 16.67, 16.67, 16.67, 16....
## $ intel1_2 <dbl> 13.89, 13.89, 13.89, 13....
## $ fun1_2 <dbl> 22.22, 22.22, 22.22, 22....
## $ amb1_2 <dbl> 11.11, 11.11, 11.11, 11....
## $ shar1_2 <dbl> 16.67, 16.67, 16.67, 16....
## $ attr4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ sinc3_2 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ intel3_2 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ fun3_2 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ amb3_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ attr5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ num_matches_you_called <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ num_matches_called_you <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ have_you_dated <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ numdat_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ num_in_3 <dbl> NA, NA, NA, NA, NA, NA, ...
## $ attr1_3 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ sinc1_3 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel1_3 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ fun1_3 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ amb1_3 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ shar1_3 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ attr7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_3 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ sinc3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ intel3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ fun3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ amb3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ attr5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
Ainda há colunas com sufixos misteriosos, como 1_1
adjust_column_feature <- function(x, suffix, meaning ){
suffix_removed <- str_remove(string = x, pattern = suffix)
type <- case_when(
suffix_removed == "attr" ~ "attractive",
suffix_removed == "sinc" ~ "sincere",
suffix_removed == "intel" ~ "intelligent",
suffix_removed == "fun" ~ "fun",
suffix_removed == "amb" ~ "ambitious",
suffix_removed == "shar" ~ "shared_interests"
)
str_glue("{meaning}_{type}")
}
dados_speed_date_rename_with <- dados_speed_date_renomeado %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "4_1", meaning = "competitors_look_for_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "4_2", meaning = "competitors_look_for_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "4_3", meaning = "competitors_look_for_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "1_1", meaning = "you_look_for_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_s"),
.fn = ~adjust_column_feature(x = .x, suffix = "1_s", meaning = "you_look_for_half_way_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "1_2", meaning = "you_look_for_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "1_3", meaning = "you_look_for_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "2_1", meaning = "you_think_opposite_sex_look_for_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "2_2", meaning = "you_think_opposite_sex_look_for_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "2_3", meaning = "you_think_opposite_sex_look_for_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "5_1", meaning = "others_perceive_you_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "5_2", meaning = "others_perceive_you_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "5_3", meaning = "others_perceive_you_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "3_1", meaning = "you_perceive_yourself_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "3_2", meaning = "you_perceive_yourself_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_s"),
.fn = ~adjust_column_feature(x = .x, suffix = "3_s", meaning = "you_perceive_yourself_half_way_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "3_3", meaning = "you_perceive_yourself_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)7_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "7_2", meaning = "actual_importance_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)7_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "7_3", meaning = "actual_importance_follow_up_weeks_")
) %>%
select(
-c(
undergra,
mn_sat,
tuition
)
)
glimpse(dados_speed_date_rename_with)## Rows: 8,378
## Columns: 192
## $ unique_id_number <dbl> 1...
## $ id_within_wave <dbl> 1...
## $ male <dbl> 0...
## $ subject_within_gender <dbl> 1...
## $ choice <dbl> 1...
## $ wave <dbl> 1...
## $ n_people_met_in_wave <dbl> 1...
## $ position_meeting <dbl> 7...
## $ position_started <lgl> N...
## $ order_meeting <dbl> 4...
## $ partnet_id_within_wave <dbl> 1...
## $ partner_unique_id_number <dbl> 1...
## $ match <dbl> 0...
## $ interests_correlation <dbl> 0...
## $ same_race <dbl> 0...
## $ partner_age <dbl> 2...
## $ partner_race <dbl> 2...
## $ partner_stated_pref_time0_attractive <dbl> 3...
## $ partner_stated_pref_time0_sincere <dbl> 2...
## $ partner_stated_pref_time0_intelligent <dbl> 2...
## $ partner_stated_pref_time0_fun <dbl> 2...
## $ partner_stated_pref_time0_ambitious <dbl> 0...
## $ partner_stated_pref_time0_shared_interests <dbl> 5...
## $ partner_liked_me <dbl> 0...
## $ partner_found_me__attractive <dbl> 6...
## $ partner_found_me__sincere <dbl> 8...
## $ partner_found_me__intelligent <dbl> 8...
## $ partner_found_me__fun <dbl> 8...
## $ partner_found_me__ambitious <dbl> 8...
## $ partner_found_me__interests <dbl> 6...
## $ like_o <dbl> 7...
## $ prob_o <dbl> 4...
## $ met_o <dbl> 2...
## $ my_age <dbl> 2...
## $ field <chr> "...
## $ cod_field <dbl> 1...
## $ race <dbl> 4...
## $ importance_same_race <dbl> 2...
## $ importance_same_religion <dbl> 4...
## $ place_from <chr> "...
## $ zipcode <dbl> 6...
## $ income_zipcode <dbl> 6...
## $ goal <dbl> 2...
## $ frequency_date <dbl> 7...
## $ frequency_go_out <dbl> 1...
## $ career <chr> "...
## $ career_macro <dbl> N...
## $ sports <dbl> 9...
## $ tvsports <dbl> 2...
## $ exercise <dbl> 8...
## $ dining <dbl> 9...
## $ museums <dbl> 1...
## $ art <dbl> 1...
## $ hiking <dbl> 5...
## $ gaming <dbl> 1...
## $ clubbing <dbl> 5...
## $ reading <dbl> 6...
## $ tv <dbl> 9...
## $ theater <dbl> 1...
## $ movies <dbl> 1...
## $ concerts <dbl> 1...
## $ music <dbl> 9...
## $ shopping <dbl> 8...
## $ yoga <dbl> 1...
## $ happy_expec <dbl> 3...
## $ n_expect_like_you <dbl> 2...
## $ you_look_for__attractive <dbl> 1...
## $ you_look_for__sincere <dbl> 2...
## $ you_look_for__intelligent <dbl> 2...
## $ you_look_for__fun <dbl> 1...
## $ you_look_for__ambitious <dbl> 1...
## $ you_look_for__shared_interests <dbl> 1...
## $ competitors_look_for__attractive <lgl> N...
## $ competitors_look_for__sincere <lgl> N...
## $ competitors_look_for__intelligent <lgl> N...
## $ competitors_look_for__fun <lgl> N...
## $ competitors_look_for__ambitious <lgl> N...
## $ competitors_look_for__shared_interests <lgl> N...
## $ you_think_opposite_sex_look_for__attractive <dbl> 3...
## $ you_think_opposite_sex_look_for__sincere <dbl> 2...
## $ you_think_opposite_sex_look_for__intelligent <dbl> 1...
## $ you_think_opposite_sex_look_for__fun <dbl> 2...
## $ you_think_opposite_sex_look_for__ambitious <dbl> 5...
## $ you_think_opposite_sex_look_for__shared_interests <dbl> 5...
## $ you_perceive_yourself__attractive <dbl> 6...
## $ you_perceive_yourself__sincere <dbl> 8...
## $ you_perceive_yourself__fun <dbl> 8...
## $ you_perceive_yourself__intelligent <dbl> 8...
## $ you_perceive_yourself__ambitious <dbl> 7...
## $ others_perceive_you__attractive <lgl> N...
## $ others_perceive_you__sincere <lgl> N...
## $ others_perceive_you__intelligent <lgl> N...
## $ others_perceive_you__fun <lgl> N...
## $ others_perceive_you__ambitious <lgl> N...
## $ i_liked_partner <dbl> 1...
## $ i_found_partner__attractive <dbl> 6...
## $ i_found_partner__sincere <dbl> 9...
## $ i_found_partner__intelligent <dbl> 7...
## $ i_found_partner__fun <dbl> 7...
## $ i_found_partner__ambitious <dbl> 6...
## $ i_found_partner__interests <dbl> 5...
## $ degree_i_liked_partner <dbl> 7...
## $ probability_i_find_partner_liked_me <dbl> 6...
## $ met_before <dbl> 2...
## $ n_matches_you_think <dbl> 4...
## $ you_look_for_half_way__attractive <lgl> N...
## $ you_look_for_half_way__sincere <lgl> N...
## $ you_look_for_half_way__intelligent <lgl> N...
## $ you_look_for_half_way__fun <lgl> N...
## $ you_look_for_half_way__ambitious <lgl> N...
## $ you_look_for_half_way__shared_interests <lgl> N...
## $ you_perceive_yourself_half_way__attractive <lgl> N...
## $ you_perceive_yourself_half_way__sincere <lgl> N...
## $ you_perceive_yourself_half_way__intelligent <lgl> N...
## $ you_perceive_yourself_half_way__fun <lgl> N...
## $ you_perceive_yourself_half_way__ambitious <lgl> N...
## $ satisfaction_with_partners <dbl> 6...
## $ opinion_duration_of_date <dbl> 2...
## $ opinion_num_dates <dbl> 1...
## $ actual_importance__attractive <lgl> N...
## $ actual_importance__sincere <lgl> N...
## $ actual_importance__intelligent <lgl> N...
## $ actual_importance__fun <lgl> N...
## $ actual_importance__ambitious <lgl> N...
## $ actual_importance__shared_interests <lgl> N...
## $ you_look_for_follow_up__attractive <dbl> 1...
## $ you_look_for_follow_up__sincere <dbl> 1...
## $ you_look_for_follow_up__intelligent <dbl> 1...
## $ you_look_for_follow_up__fun <dbl> 2...
## $ you_look_for_follow_up__ambitious <dbl> 1...
## $ you_look_for_follow_up__shared_interests <dbl> 1...
## $ competitors_look_for_follow_up__attractive <lgl> N...
## $ competitors_look_for_follow_up__sincere <lgl> N...
## $ competitors_look_for_follow_up__intelligent <lgl> N...
## $ competitors_look_for_follow_up__fun <lgl> N...
## $ competitors_look_for_follow_up__ambitious <lgl> N...
## $ competitors_look_for_follow_up__shared_interests <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__attractive <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__sincere <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__intelligent <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__fun <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__ambitious <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__shared_interests <lgl> N...
## $ you_perceive_yourself_follow_up__attractive <dbl> 6...
## $ you_perceive_yourself_follow_up__sincere <dbl> 7...
## $ you_perceive_yourself_follow_up__intelligent <dbl> 8...
## $ you_perceive_yourself_follow_up__fun <dbl> 7...
## $ you_perceive_yourself_follow_up__ambitious <dbl> 6...
## $ others_perceive_you_follow_up__attractive <lgl> N...
## $ others_perceive_you_follow_up__sincere <lgl> N...
## $ others_perceive_you_follow_up__intelligent <lgl> N...
## $ others_perceive_you_follow_up__fun <lgl> N...
## $ others_perceive_you_follow_up__ambitious <lgl> N...
## $ num_matches_you_called <dbl> 1...
## $ num_matches_called_you <dbl> 1...
## $ have_you_dated <dbl> 0...
## $ numdat_3 <lgl> N...
## $ num_in_3 <dbl> N...
## $ you_look_for_follow_up_weeks__attractive <dbl> 1...
## $ you_look_for_follow_up_weeks__sincere <dbl> 2...
## $ you_look_for_follow_up_weeks__intelligent <dbl> 2...
## $ you_look_for_follow_up_weeks__fun <dbl> 1...
## $ you_look_for_follow_up_weeks__ambitious <dbl> 1...
## $ you_look_for_follow_up_weeks__shared_interests <dbl> 1...
## $ actual_importance_follow_up_weeks__attractive <lgl> N...
## $ actual_importance_follow_up_weeks__sincere <lgl> N...
## $ actual_importance_follow_up_weeks__intelligent <lgl> N...
## $ actual_importance_follow_up_weeks__fun <lgl> N...
## $ actual_importance_follow_up_weeks__ambitious <lgl> N...
## $ actual_importance_follow_up_weeks__shared_interests <lgl> N...
## $ competitors_look_for_follow_up_weeks__attractive <lgl> N...
## $ competitors_look_for_follow_up_weeks__sincere <lgl> N...
## $ competitors_look_for_follow_up_weeks__intelligent <lgl> N...
## $ competitors_look_for_follow_up_weeks__fun <lgl> N...
## $ competitors_look_for_follow_up_weeks__ambitious <lgl> N...
## $ competitors_look_for_follow_up_weeks__shared_interests <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__attractive <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__sincere <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__intelligent <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__fun <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__ambitious <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__shared_interests <lgl> N...
## $ you_perceive_yourself_follow_up_weeks__attractive <dbl> 5...
## $ you_perceive_yourself_follow_up_weeks__sincere <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__intelligent <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__fun <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__ambitious <dbl> 7...
## $ others_perceive_you_follow_up_weeks__attractive <lgl> N...
## $ others_perceive_you_follow_up_weeks__sincere <lgl> N...
## $ others_perceive_you_follow_up_weeks__intelligent <lgl> N...
## $ others_perceive_you_follow_up_weeks__fun <lgl> N...
## $ others_perceive_you_follow_up_weeks__ambitious <lgl> N...
Muitos atributos estão codificados numericamente, o que atrapalha a interpretação, eles foram transformados em vetores de caracteres.
Essa codificação numérica é muito comum em produtos de análise estatística de prateleira, que possibilitam point-and-click.
Reparem que o atributo frequency_date foi transformado em um fator onde os levels tem uma ordem espcífica. Isso trará implicações posteriores.
dados_speed_date_fatores <- dados_speed_date_rename_with %>% mutate(
choice = if_else(choice == 1, "limited", "extensive") ,
field_factor = case_when(
cod_field == 1 ~ "Law",
cod_field == 2 ~ "Math",
cod_field == 3 ~ "Social Science, Psychologist" ,
cod_field == 4 ~ "Medical Science, Pharmaceuticals, and Bio Tech" ,
cod_field == 5 ~ "Engineering" ,
cod_field == 6 ~ "English/Creative Writing/ Journalism" ,
cod_field == 7 ~ "History/Religion/Philosophy" ,
cod_field == 8 ~ "Business/Econ/Finance" ,
cod_field == 9 ~ "Education, Academia" ,
cod_field == 10 ~ "Biological Sciences/Chemistry/Physics",
cod_field == 11 ~ "Social Work" ,
cod_field == 12 ~ "Undergrad/undecided" ,
cod_field == 13 ~ "Political Science/International Affairs" ,
cod_field == 14 ~ "Film",
cod_field == 15 ~ "Fine Arts/Arts Administration",
cod_field == 16 ~ "Languages",
cod_field == 17 ~ "Architecture",
cod_field == 18 ~ "Other"
),
race = case_when(
race == 1 ~ "Black",
race == 2 ~ "White",
race == 3 ~ "Latino",
race == 4 ~ "Asian" ,
race == 5 ~ "Native American" ,
race == 6 ~ "Others"
),
partner_race = case_when(
partner_race == 1 ~ "Black",
partner_race == 2 ~ "White",
partner_race == 3 ~ "Latino",
partner_race == 4 ~ "Asian" ,
partner_race == 5 ~ "Native American" ,
partner_race == 6 ~ "Others"
),
goal = case_when(
goal == 1 ~ "Fun",
goal == 2 ~ "Meet new people",
goal == 3 ~ "Date",
goal == 4 ~ "Serious",
goal == 5 ~ "To say",
goal == 6 ~ "Other"
),
cod_frequency_date = frequency_date
,
frequency_date =
case_when(
frequency_date == 1 ~ "Several a week",
frequency_date == 2 ~ "Twice a week",
frequency_date == 3 ~ "Once a week",
frequency_date == 4 ~ "Twice a month",
frequency_date == 5 ~ "Once a month",
frequency_date == 6 ~ "Several a year",
frequency_date == 7 ~ "Never"
) %>%
factor(
level = c(
"Several a week",
"Twice a week",
"Once a week",
"Twice a month",
"Once a month",
"Several a year",
"Never"
),
ordered = TRUE
)
,
frequency_go_out =
case_when(
frequency_go_out == 1 ~ "Several a week",
frequency_go_out == 2 ~ "Twice a week",
frequency_date == 3 ~ "Once a week",
frequency_date == 4 ~ "Twice a month",
frequency_date == 5 ~ "Once a month",
frequency_date == 6 ~ "Several a year",
frequency_date == 7 ~ "Never"
) %>%
factor(
level = c(
"Several a week",
"Twice a week",
"Once a week",
"Twice a month",
"Once a month",
"Several a year",
"Never"
),
ordered = TRUE
) ,
career = str_to_title(career),
career_macro =
case_when(
career_macro == 1 ~ "Lawyer",
career_macro == 2 ~ "Academic/Research",
career_macro == 3 ~ "Psychologist" ,
career_macro == 4 ~ "Doctor/Medicine" ,
career_macro == 5 ~ "Engineer" ,
career_macro == 6 ~ "Creative Arts/Entertainment" ,
career_macro == 7 ~ "Banking/Consulting/Finance/Marketing/Business/CEO/Entrepreneur/Admin" ,
career_macro == 8 ~ "Real Estate" ,
career_macro == 9 ~ "International/Humanitarian Affairs" ,
career_macro == 10 ~ "Undecided" ,
career_macro == 11 ~ "Social Work",
career_macro == 12 ~ "Speech Pathology",
career_macro == 13 ~ "Politics",
career_macro == 14 ~ "Pro sports/Athletics",
career_macro == 15 ~ "Other",
career_macro == 16 ~ "Journalism",
career_macro == 17 ~ "Architecture"
),
met_before = if_else(met_before == 1, TRUE, FALSE),
opinion_duration_of_date = case_when(
opinion_duration_of_date == 1 ~ "Too little",
opinion_duration_of_date == 2 ~ "Too much",
opinion_duration_of_date == 3 ~ "Just Right",
),
opinion_num_dates = case_when(
opinion_num_dates == 1 ~ "Too few",
opinion_num_dates == 2 ~ "Too many"
),
have_you_dated = case_when(
have_you_dated == 1 ~ TRUE,
have_you_dated == 2 ~ FALSE
)
,
sex = if_else(male > 0, "Homem", "Mulher") %>% as_factor()
) %>%
select(
match,
unique_id_number,
id_within_wave,
sex,
subject_within_gender,
choice,
n_people_met_in_wave,
position_meeting,
position_started,
order_meeting,
partnet_id_within_wave,
partner_unique_id_number,
interests_correlation,
same_race,
my_age,
partner_age,
partner_race,
partner_stated_pref_time0_attractive,
partner_stated_pref_time0_sincere,
partner_stated_pref_time0_intelligent,
partner_stated_pref_time0_fun,
partner_stated_pref_time0_ambitious,
partner_stated_pref_time0_shared_interests,
importance_same_race,
importance_same_religion,
income_zipcode,
frequency_date,
frequency_go_out,
career_macro,
happy_expec,
n_expect_like_you,
partner_liked_me,
i_liked_partner,
i_found_partner__attractive,
i_found_partner__sincere,
i_found_partner__intelligent,
i_found_partner__fun,
i_found_partner__ambitious,
i_found_partner__interests,
partner_found_me__attractive,
partner_found_me__sincere,
partner_found_me__intelligent,
partner_found_me__fun,
partner_found_me__ambitious,
partner_found_me__interests,
probability_i_find_partner_liked_me,
met_before,
opinion_duration_of_date,
opinion_num_dates,
starts_with("competitors_look_for__"),
starts_with("you_look_for__"),
starts_with("opposite_sex_look_for__"),
starts_with("others_perceive_you__"),
starts_with("you_perceive_yourself__"),
starts_with("actual_importance__"),
choice,
race,
goal,
frequency_date,
career_macro,
met_before,
opinion_duration_of_date,
opinion_num_dates,
) %>%
mutate(
across(
.cols = where(is.character),
.fns = as.factor
)
) %>%
mutate(
across(
.cols = c(match, same_race, partner_liked_me, i_liked_partner) ,
.fns = as.logical
)
)
glimpse(dados_speed_date_fatores)## Rows: 8,378
## Columns: 79
## $ match <lgl> FALSE, FALSE, TRUE, TRUE...
## $ unique_id_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ sex <fct> Mulher, Mulher, Mulher, ...
## $ subject_within_gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice <fct> limited, limited, limite...
## $ n_people_met_in_wave <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number <dbl> 11, 12, 13, 14, 15, 16, ...
## $ interests_correlation <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race <lgl> FALSE, FALSE, TRUE, FALS...
## $ my_age <dbl> 21, 21, 21, 21, 21, 21, ...
## $ partner_age <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race <fct> White, White, Asian, Whi...
## $ partner_stated_pref_time0_attractive <dbl> 35.00, 60.00, 19.00, 30....
## $ partner_stated_pref_time0_sincere <dbl> 20.00, 0.00, 18.00, 5.00...
## $ partner_stated_pref_time0_intelligent <dbl> 20.00, 0.00, 19.00, 15.0...
## $ partner_stated_pref_time0_fun <dbl> 20.00, 40.00, 18.00, 40....
## $ partner_stated_pref_time0_ambitious <dbl> 0.00, 0.00, 14.00, 5.00,...
## $ partner_stated_pref_time0_shared_interests <dbl> 5.00, 0.00, 12.00, 5.00,...
## $ importance_same_race <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ income_zipcode <dbl> 69487, 69487, 69487, 694...
## $ frequency_date <ord> Never, Never, Never, Nev...
## $ frequency_go_out <ord> Several a week, Several ...
## $ career_macro <fct> NA, NA, NA, NA, NA, NA, ...
## $ happy_expec <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ partner_liked_me <lgl> FALSE, FALSE, TRUE, TRUE...
## $ i_liked_partner <lgl> TRUE, TRUE, TRUE, TRUE, ...
## $ i_found_partner__attractive <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ partner_found_me__attractive <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ probability_i_find_partner_liked_me <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before <lgl> FALSE, TRUE, TRUE, FALSE...
## $ opinion_duration_of_date <fct> Too much, Too much, Too ...
## $ opinion_num_dates <fct> Too few, Too few, Too fe...
## $ competitors_look_for__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__shared_interests <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_look_for__attractive <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__sincere <dbl> 20, 20, 20, 20, 20, 20, ...
## $ you_look_for__intelligent <dbl> 20, 20, 20, 20, 20, 20, ...
## $ you_look_for__fun <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__ambitious <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__shared_interests <dbl> 15, 15, 15, 15, 15, 15, ...
## $ others_perceive_you__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_perceive_yourself__attractive <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ you_perceive_yourself__sincere <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__fun <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__intelligent <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__ambitious <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ actual_importance__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__shared_interests <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race <fct> Asian, Asian, Asian, Asi...
## $ goal <fct> Meet new people, Meet ne...
Algumas perguntas foram feitas de forma inconsistente ao longo dos dias da pesquisa.
Em alguns dias foi dado um orçamento de x pontos para os entrevistados distribuírem nos atributos de mesmo tipo, em outros foi dado um orçamento pra cada atributo.
normaliza_no_prefixo <- function(
df = dados_com_representacao ,
prefixo = "partner_stated_pref_time0_" ){
dados_speed_date_normalizada <- df %>%
rowwise() %>%
mutate(
"{prefixo}_soma" :=
sum(c_across(starts_with(prefixo)), na.rm = TRUE)
) %>%
mutate(
across(
.cols = starts_with(prefixo),
.fns = ~.x / .data[[str_glue("{prefixo}_soma")]]
)
) %>%
select(
-contains(str_glue("{prefixo}_soma"))
)
}
dados_speed_date_normalizada <- dados_speed_date_fatores %>%
normaliza_no_prefixo("partner_stated_pref_time0_" ) %>%
normaliza_no_prefixo("you_look_for__" ) %>%
normaliza_no_prefixo("opposite_sex_look_for__" ) %>%
ungroup()
glimpse(dados_speed_date_normalizada)## Rows: 8,378
## Columns: 79
## $ match <lgl> FALSE, FALSE, TRUE, TRUE...
## $ unique_id_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ sex <fct> Mulher, Mulher, Mulher, ...
## $ subject_within_gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice <fct> limited, limited, limite...
## $ n_people_met_in_wave <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number <dbl> 11, 12, 13, 14, 15, 16, ...
## $ interests_correlation <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race <lgl> FALSE, FALSE, TRUE, FALS...
## $ my_age <dbl> 21, 21, 21, 21, 21, 21, ...
## $ partner_age <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race <fct> White, White, Asian, Whi...
## $ partner_stated_pref_time0_attractive <dbl> 0.3500000, 0.6000000, 0....
## $ partner_stated_pref_time0_sincere <dbl> 0.2000000, 0.0000000, 0....
## $ partner_stated_pref_time0_intelligent <dbl> 0.2000000, 0.0000000, 0....
## $ partner_stated_pref_time0_fun <dbl> 0.2000000, 0.4000000, 0....
## $ partner_stated_pref_time0_ambitious <dbl> 0.0000000, 0.0000000, 0....
## $ partner_stated_pref_time0_shared_interests <dbl> 0.0500000, 0.0000000, 0....
## $ importance_same_race <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ income_zipcode <dbl> 69487, 69487, 69487, 694...
## $ frequency_date <ord> Never, Never, Never, Nev...
## $ frequency_go_out <ord> Several a week, Several ...
## $ career_macro <fct> NA, NA, NA, NA, NA, NA, ...
## $ happy_expec <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ partner_liked_me <lgl> FALSE, FALSE, TRUE, TRUE...
## $ i_liked_partner <lgl> TRUE, TRUE, TRUE, TRUE, ...
## $ i_found_partner__attractive <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ partner_found_me__attractive <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ probability_i_find_partner_liked_me <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before <lgl> FALSE, TRUE, TRUE, FALSE...
## $ opinion_duration_of_date <fct> Too much, Too much, Too ...
## $ opinion_num_dates <fct> Too few, Too few, Too fe...
## $ competitors_look_for__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__shared_interests <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_look_for__attractive <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__sincere <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__intelligent <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__fun <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__ambitious <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__shared_interests <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ others_perceive_you__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_perceive_yourself__attractive <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ you_perceive_yourself__sincere <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__fun <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__intelligent <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__ambitious <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ actual_importance__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__shared_interests <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race <fct> Asian, Asian, Asian, Asi...
## $ goal <fct> Meet new people, Meet ne...
A biblioteca skim, com a função skimr(), oferece uma boa forma de ver um resumo com a característica dos dados
| Name | dados_speed_date_normaliz… |
| Number of rows | 8378 |
| Number of columns | 79 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| logical | 23 |
| numeric | 46 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1.00 | FALSE | 2 | Hom: 4194, Mul: 4184 |
| choice | 0 | 1.00 | FALSE | 2 | ext: 6944, lim: 1434 |
| partner_race | 73 | 0.99 | FALSE | 5 | Whi: 4722, Asi: 1978, Lat: 664, Oth: 521 |
| frequency_date | 97 | 0.99 | TRUE | 7 | Sev: 2094, Twi: 2040, Onc: 1528, Nev: 1434 |
| frequency_go_out | 2778 | 0.67 | TRUE | 2 | Twi: 2990, Sev: 2610, Onc: 0, Twi: 0 |
| career_macro | 138 | 0.98 | FALSE | 17 | Aca: 2320, Ban: 2170, Cre: 724, Law: 675 |
| opinion_duration_of_date | 915 | 0.89 | FALSE | 3 | Too: 4227, Jus: 3059, Too: 177 |
| opinion_num_dates | 4107 | 0.51 | FALSE | 2 | Too: 3622, Too: 649 |
| race | 63 | 0.99 | FALSE | 5 | Whi: 4727, Asi: 1982, Lat: 664, Oth: 522 |
| goal | 79 | 0.99 | FALSE | 6 | Fun: 3426, Mee: 3012, Dat: 631, To : 510 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| match | 0 | 1.00 | 0.16 | FAL: 6998, TRU: 1380 |
| position_started | 7974 | 0.05 | 1.00 | TRU: 404 |
| same_race | 0 | 1.00 | 0.40 | FAL: 5062, TRU: 3316 |
| partner_liked_me | 0 | 1.00 | 0.42 | FAL: 4863, TRU: 3515 |
| i_liked_partner | 0 | 1.00 | 0.42 | FAL: 4860, TRU: 3518 |
| met_before | 375 | 0.96 | 0.04 | FAL: 7652, TRU: 351 |
| competitors_look_for__attractive | 8378 | 0.00 | NaN | : |
| competitors_look_for__sincere | 7997 | 0.05 | 0.05 | FAL: 363, TRU: 18 |
| competitors_look_for__intelligent | 8204 | 0.02 | 0.28 | FAL: 125, TRU: 49 |
| competitors_look_for__fun | 8319 | 0.01 | 0.31 | FAL: 41, TRU: 18 |
| competitors_look_for__ambitious | 7693 | 0.08 | 0.18 | FAL: 563, TRU: 122 |
| competitors_look_for__shared_interests | 8059 | 0.04 | 0.15 | FAL: 271, TRU: 48 |
| others_perceive_you__attractive | 8378 | 0.00 | NaN | : |
| others_perceive_you__sincere | 8368 | 0.00 | 1.00 | TRU: 10 |
| others_perceive_you__intelligent | 8378 | 0.00 | NaN | : |
| others_perceive_you__fun | 8378 | 0.00 | NaN | : |
| others_perceive_you__ambitious | 8363 | 0.00 | 1.00 | TRU: 15 |
| actual_importance__attractive | 8378 | 0.00 | NaN | : |
| actual_importance__sincere | 8205 | 0.02 | 0.00 | FAL: 173 |
| actual_importance__intelligent | 8297 | 0.01 | 0.00 | FAL: 81 |
| actual_importance__fun | 8344 | 0.00 | 0.00 | FAL: 34 |
| actual_importance__ambitious | 7842 | 0.06 | 0.00 | FAL: 536 |
| actual_importance__shared_interests | 8165 | 0.03 | 0.00 | FAL: 213 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| unique_id_number | 0 | 1.00 | 283.68 | 158.58 | 1.00 | 154.00 | 281.00 | 407.00 | 552.00 | ▇▇▇▇▇ |
| id_within_wave | 1 | 1.00 | 8.96 | 5.49 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| subject_within_gender | 0 | 1.00 | 17.33 | 10.94 | 1.00 | 8.00 | 16.00 | 26.00 | 44.00 | ▇▇▅▅▂ |
| n_people_met_in_wave | 0 | 1.00 | 16.87 | 4.36 | 5.00 | 14.00 | 18.00 | 20.00 | 22.00 | ▁▃▂▅▇ |
| position_meeting | 0 | 1.00 | 9.04 | 5.51 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| order_meeting | 0 | 1.00 | 8.93 | 5.48 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partnet_id_within_wave | 0 | 1.00 | 8.96 | 5.49 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partner_unique_id_number | 10 | 1.00 | 283.86 | 158.58 | 1.00 | 154.00 | 281.00 | 408.00 | 552.00 | ▇▇▇▇▇ |
| interests_correlation | 158 | 0.98 | 0.20 | 0.30 | -0.83 | -0.02 | 0.21 | 0.43 | 0.91 | ▁▃▇▇▂ |
| my_age | 95 | 0.99 | 26.36 | 3.57 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_age | 104 | 0.99 | 26.36 | 3.56 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_stated_pref_time0_attractive | 89 | 0.99 | 0.22 | 0.13 | 0.00 | 0.15 | 0.20 | 0.25 | 1.00 | ▇▃▁▁▁ |
| partner_stated_pref_time0_sincere | 89 | 0.99 | 0.17 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.60 | ▃▇▂▁▁ |
| partner_stated_pref_time0_intelligent | 89 | 0.99 | 0.20 | 0.07 | 0.00 | 0.17 | 0.20 | 0.23 | 0.50 | ▂▇▃▁▁ |
| partner_stated_pref_time0_fun | 98 | 0.99 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| partner_stated_pref_time0_ambitious | 107 | 0.99 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▇▇▇▁▁ |
| partner_stated_pref_time0_shared_interests | 129 | 0.98 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▆▇▇▃▁ |
| importance_same_race | 79 | 0.99 | 3.78 | 2.85 | 0.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▂▂▂ |
| importance_same_religion | 79 | 0.99 | 3.65 | 2.81 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▃▂▁ |
| income_zipcode | 4099 | 0.51 | 44887.61 | 17206.92 | 8607.00 | 31516.00 | 43185.00 | 54303.00 | 109031.00 | ▃▇▅▂▁ |
| happy_expec | 101 | 0.99 | 5.53 | 1.73 | 1.00 | 5.00 | 6.00 | 7.00 | 10.00 | ▁▃▇▃▁ |
| n_expect_like_you | 6578 | 0.21 | 5.57 | 4.76 | 0.00 | 2.00 | 4.00 | 8.00 | 20.00 | ▇▃▂▁▁ |
| i_found_partner__attractive | 202 | 0.98 | 6.19 | 1.95 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| i_found_partner__sincere | 277 | 0.97 | 7.18 | 1.74 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__intelligent | 296 | 0.96 | 7.37 | 1.55 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__fun | 350 | 0.96 | 6.40 | 1.95 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▇▇▂ |
| i_found_partner__ambitious | 712 | 0.92 | 6.78 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| i_found_partner__interests | 1067 | 0.87 | 5.47 | 2.16 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| partner_found_me__attractive | 212 | 0.97 | 6.19 | 1.95 | 0.00 | 5.00 | 6.00 | 8.00 | 10.50 | ▁▃▇▇▂ |
| partner_found_me__sincere | 287 | 0.97 | 7.18 | 1.74 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__intelligent | 306 | 0.96 | 7.37 | 1.55 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__fun | 360 | 0.96 | 6.40 | 1.95 | 0.00 | 5.00 | 7.00 | 8.00 | 11.00 | ▁▂▇▇▂ |
| partner_found_me__ambitious | 722 | 0.91 | 6.78 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| partner_found_me__interests | 1076 | 0.87 | 5.47 | 2.16 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| probability_i_find_partner_liked_me | 309 | 0.96 | 5.21 | 2.13 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| you_look_for__attractive | 79 | 0.99 | 0.23 | 0.13 | 0.00 | 0.15 | 0.20 | 0.25 | 1.00 | ▇▃▁▁▁ |
| you_look_for__sincere | 79 | 0.99 | 0.17 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.60 | ▃▇▂▁▁ |
| you_look_for__intelligent | 79 | 0.99 | 0.20 | 0.07 | 0.00 | 0.17 | 0.20 | 0.23 | 0.50 | ▂▇▃▁▁ |
| you_look_for__fun | 89 | 0.99 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| you_look_for__ambitious | 99 | 0.99 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▇▇▇▁▁ |
| you_look_for__shared_interests | 121 | 0.99 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▆▇▇▃▁ |
| you_perceive_yourself__attractive | 105 | 0.99 | 7.08 | 1.40 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| you_perceive_yourself__sincere | 105 | 0.99 | 8.29 | 1.41 | 2.00 | 8.00 | 8.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| you_perceive_yourself__fun | 105 | 0.99 | 7.70 | 1.56 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| you_perceive_yourself__intelligent | 105 | 0.99 | 8.40 | 1.08 | 3.00 | 8.00 | 8.00 | 9.00 | 10.00 | ▁▁▃▆▇ |
| you_perceive_yourself__ambitious | 105 | 0.99 | 7.58 | 1.78 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▆ |
A função skim() devolve um tibble, que pode ser usado para extrair estatísticas da base
## Rows: 79
## Columns: 17
## $ skim_type <chr> "factor", "factor", "factor", "factor", "factor",...
## $ skim_variable <chr> "sex", "choice", "partner_race", "frequency_date"...
## $ n_missing <int> 0, 0, 73, 97, 2778, 138, 915, 4107, 63, 79, 0, 79...
## $ complete_rate <dbl> 1.000000000, 1.000000000, 0.991286703, 0.98842205...
## $ factor.ordered <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FA...
## $ factor.n_unique <int> 2, 2, 5, 7, 2, 17, 3, 2, 5, 6, NA, NA, NA, NA, NA...
## $ factor.top_counts <chr> "Hom: 4194, Mul: 4184", "ext: 6944, lim: 1434", "...
## $ logical.mean <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.1647171...
## $ logical.count <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "FAL: 699...
## $ numeric.mean <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.sd <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p0 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p25 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p50 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p75 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p100 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.hist <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
Podemos ver que temos muito campos quase completos e alguns campos bem menos preenchidos.
De modo geral, são campos que foram preenchidos numa pesquisa feita semanas depois do evento.
Retiramos, então os dados com pouca representação
Queremos ter algumas impressões do parceiro no nosso conjunto de dados, e assim fazemos o resumo final para começar a brincar com os dados.
dados_speed_date_partner_side <- dados_speed_date_normalizada %>%
select(
unique_id_number,
partner_unique_id_number,
probability_partner_find_i_liked_partner = probability_i_find_partner_liked_me,
partner_career_macro = career_macro,
starts_with("you_perceive_yourself__")
) %>%
rename_with(
.cols = starts_with("you_perceive_yourself__"),
.fn = ~str_replace(.x, "you_perceive_yourself__", "partner_perceives_himself__")
)
dados_finais <- dados_com_representacao %>%
left_join(
dados_speed_date_partner_side,
by = c("unique_id_number" = "partner_unique_id_number", "partner_unique_id_number" = "unique_id_number" )
) %>%
filter(
across(
.cols = everything(),
.fns = ~!is.na(.x)
)
) %>%
mutate(
across(
.cols = where(is.logical) ,
.fns = as.numeric
)
)
resumo_com_representacao <- skim(dados_finais)
resumo_com_representacao| Name | dados_finais |
| Number of rows | 4885 |
| Number of columns | 64 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 55 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | Hom: 2456, Mul: 2429 |
| choice | 0 | 1 | FALSE | 2 | ext: 4099, lim: 786 |
| partner_race | 0 | 1 | FALSE | 5 | Whi: 2683, Asi: 1210, Lat: 386, Oth: 357 |
| frequency_date | 0 | 1 | TRUE | 7 | Twi: 1301, Sev: 1242, Onc: 910, Nev: 793 |
| career_macro | 0 | 1 | FALSE | 17 | Aca: 1472, Ban: 1236, Cre: 419, Law: 401 |
| opinion_duration_of_date | 0 | 1 | FALSE | 3 | Too: 2786, Jus: 1996, Too: 103 |
| race | 0 | 1 | FALSE | 5 | Whi: 2687, Asi: 1190, Lat: 406, Oth: 372 |
| goal | 0 | 1 | FALSE | 6 | Fun: 2054, Mee: 1793, Dat: 372, To : 273 |
| partner_career_macro | 0 | 1 | FALSE | 17 | Aca: 1396, Ban: 1292, Cre: 408, Law: 379 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| match | 0 | 1 | 0.18 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| same_race | 0 | 1 | 0.39 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| partner_liked_me | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| i_liked_partner | 0 | 1 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| met_before | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| unique_id_number | 0 | 1 | 283.07 | 156.88 | 4.00 | 160.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| id_within_wave | 0 | 1 | 9.10 | 5.57 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| subject_within_gender | 0 | 1 | 17.61 | 11.09 | 1.00 | 8.00 | 16.00 | 26.00 | 44.00 | ▇▇▅▅▂ |
| n_people_met_in_wave | 0 | 1 | 17.01 | 4.33 | 5.00 | 14.00 | 18.00 | 20.00 | 22.00 | ▁▂▂▅▇ |
| position_meeting | 0 | 1 | 9.13 | 5.50 | 1.00 | 4.00 | 9.00 | 13.00 | 22.00 | ▇▆▅▅▂ |
| order_meeting | 0 | 1 | 8.87 | 5.46 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partnet_id_within_wave | 0 | 1 | 9.12 | 5.51 | 1.00 | 5.00 | 9.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partner_unique_id_number | 0 | 1 | 282.91 | 156.98 | 4.00 | 158.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| interests_correlation | 0 | 1 | 0.20 | 0.30 | -0.83 | -0.02 | 0.22 | 0.43 | 0.91 | ▁▃▇▇▂ |
| my_age | 0 | 1 | 26.16 | 3.44 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_age | 0 | 1 | 26.19 | 3.41 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_stated_pref_time0_attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 1.00 | ▇▃▁▁▁ |
| partner_stated_pref_time0_sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▂▇▇▁▁ |
| partner_stated_pref_time0_intelligent | 0 | 1 | 0.20 | 0.07 | 0.00 | 0.18 | 0.20 | 0.24 | 0.50 | ▁▇▃▁▁ |
| partner_stated_pref_time0_fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| partner_stated_pref_time0_ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| partner_stated_pref_time0_shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▅▇▇▃▁ |
| importance_same_race | 0 | 1 | 3.83 | 2.83 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▃▂▂ |
| importance_same_religion | 0 | 1 | 3.61 | 2.85 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▂▃▂▂ |
| happy_expec | 0 | 1 | 5.49 | 1.78 | 1.00 | 5.00 | 6.00 | 7.00 | 10.00 | ▁▃▇▅▁ |
| i_found_partner__attractive | 0 | 1 | 6.25 | 1.94 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| i_found_partner__sincere | 0 | 1 | 7.22 | 1.72 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__intelligent | 0 | 1 | 7.43 | 1.52 | 0.00 | 7.00 | 8.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__fun | 0 | 1 | 6.48 | 1.94 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▇▇▃ |
| i_found_partner__ambitious | 0 | 1 | 6.82 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▆▇▃ |
| i_found_partner__interests | 0 | 1 | 5.55 | 2.14 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| partner_found_me__attractive | 0 | 1 | 6.21 | 1.93 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| partner_found_me__sincere | 0 | 1 | 7.17 | 1.74 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__intelligent | 0 | 1 | 7.39 | 1.53 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__fun | 0 | 1 | 6.43 | 1.94 | 0.00 | 5.00 | 7.00 | 8.00 | 11.00 | ▁▂▇▇▂ |
| partner_found_me__ambitious | 0 | 1 | 6.76 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| partner_found_me__interests | 0 | 1 | 5.50 | 2.13 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| probability_i_find_partner_liked_me | 0 | 1 | 5.33 | 2.12 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| you_look_for__attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 0.90 | ▇▇▁▁▁ |
| you_look_for__sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▁▇▇▁▁ |
| you_look_for__intelligent | 0 | 1 | 0.20 | 0.07 | 0.00 | 0.18 | 0.20 | 0.23 | 0.50 | ▁▇▃▁▁ |
| you_look_for__fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| you_look_for__ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| you_look_for__shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▆▇▇▃▁ |
| you_perceive_yourself__attractive | 0 | 1 | 7.14 | 1.41 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| you_perceive_yourself__sincere | 0 | 1 | 8.33 | 1.44 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| you_perceive_yourself__fun | 0 | 1 | 7.77 | 1.58 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| you_perceive_yourself__intelligent | 0 | 1 | 8.50 | 1.08 | 3.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▂▅▇ |
| you_perceive_yourself__ambitious | 0 | 1 | 7.65 | 1.83 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
| probability_partner_find_i_liked_partner | 0 | 1 | 5.27 | 2.12 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| partner_perceives_himself__attractive | 0 | 1 | 7.11 | 1.40 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| partner_perceives_himself__sincere | 0 | 1 | 8.33 | 1.44 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| partner_perceives_himself__fun | 0 | 1 | 7.73 | 1.57 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| partner_perceives_himself__intelligent | 0 | 1 | 8.46 | 1.08 | 3.00 | 8.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▆▇ |
| partner_perceives_himself__ambitious | 0 | 1 | 7.63 | 1.82 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
Podemos ver, por exemplo, se as pessoas têm uma imagem acurada da própria atratividade
escala_sexo = c(Homem = "darkblue", Mulher = "darkred")
dados_finais %>%
ggplot(
aes(
y = partner_found_me__attractive,
x = you_perceive_yourself__attractive
)
) +
geom_boxplot(
aes(
group = you_perceive_yourself__attractive,
color = sex,
fill = sex,
alpha = 0.3
),
show.legend = FALSE
) +
scale_color_manual(
values = escala_sexo
) +
scale_fill_manual(
values = escala_sexo
) +
stat_smooth(
method = "loess",
formula = y ~ x,
show.legend = FALSE,
se = FALSE,
aes(
color = sex
)
) +
geom_function(
fun = identity
) +
facet_wrap(
~sex
) +
scale_x_continuous(
breaks = 0:10
) +
scale_y_continuous(
breaks = 0:10
) +
labs(
x = "Me acho bonito",
y = "Parceiro me acha bonito"
) +
theme_minimal()Como o quanto eu achei o parceiro bom em algum atributo está correlacionado com o fato de eu gostar do parceiro?
dados_grafico_partner_liked <- dados_finais %>%
select(
i_liked_partner,
starts_with("i_found_partner__"),
sex
) %>%
pivot_longer(
cols = -c(i_liked_partner, sex),
names_to = "i_found_partner",
names_pattern = "i_found_partner__(.*)",
values_to = "degree"
) %>%
mutate(
degree = round(degree)
) %>%
group_by(
degree,
i_found_partner,
sex
) %>%
summarise(
i_liked_partner = mean(i_liked_partner),
n = n()
) %>%
filter(
n > 100
)
ggplot(dados_grafico_partner_liked) +
geom_line(
aes(
x = degree,
y = i_liked_partner,
color = sex,
),
size = 1.2
) +
geom_point(
aes(
x = degree,
y = i_liked_partner,
color = sex,
size = n
)
) +
facet_wrap(
~i_found_partner
) +
theme_minimal() +
theme(
legend.position = "top"
) +
scale_x_continuous(
breaks = 1:10
) +
scale_y_continuous(
limits = c(0,1),
breaks = seq(0, to = 1, by = .2),
labels = percent_format(accuracy = 1)
) +
scale_color_manual(
values = escala_sexo
) +
labs(
x = "Gostei deste atributo no parceiro",
y = "Gostei do parceiro. Quero ele(a)"
)Na análise anterior, fizemos a média condicional variável a variável, mas podemos fazer a média condicional a todas as variáveis ao mesmo tempo.
A forma de fazer isso é rodando uma regressão linear de mínimos quadrados ordinários múltipla.
parnsnip é a sucessora do núcleo da caret.
Ela é usada para oferecer uma interface genérica a alguns tipos de modelos de aprensizado estatístico
No caso, escolhemos um modelo linear e usamos como engine a função lm do R
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Agora rodamos efetivamente o modelo
Notem que o modelo é rodado com as interações entre os atributos e a dummy “sex”
A biblioteca yardstick oferece métodos para extrairmos métrica e estimações de dentro dos objetos retornados pelas funções de treinamento da parsnip, como fit()
lm_fit <-
lm_mod %>%
fit( i_liked_partner ~
sex +
i_found_partner__attractive * sex +
i_found_partner__ambitious * sex +
i_found_partner__fun * sex +
i_found_partner__intelligent * sex +
i_found_partner__interests * sex +
i_found_partner__sincere * sex ,
data = dados_finais)
tidy(lm_fit)## # A tibble: 14 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.395 0.0442 -8.92 6.56e-19
## 2 sexHomem 0.0674 0.0643 1.05 2.95e- 1
## 3 i_found_partner__attractive 0.0673 0.00547 12.3 2.77e-34
## 4 i_found_partner__ambitious -0.0205 0.00633 -3.25 1.18e- 3
## 5 i_found_partner__fun 0.0365 0.00629 5.81 6.71e- 9
## 6 i_found_partner__intelligent 0.0189 0.00846 2.23 2.59e- 2
## 7 i_found_partner__interests 0.0479 0.00514 9.31 1.98e-20
## 8 i_found_partner__sincere -0.0164 0.00648 -2.53 1.13e- 2
## 9 sexHomem:i_found_partner__attractive 0.0472 0.00786 6.00 2.07e- 9
## 10 sexHomem:i_found_partner__ambitious -0.00375 0.00918 -0.409 6.83e- 1
## 11 sexHomem:i_found_partner__fun -0.000680 0.00921 -0.0738 9.41e- 1
## 12 sexHomem:i_found_partner__intelligent -0.0312 0.0122 -2.56 1.05e- 2
## 13 sexHomem:i_found_partner__interests -0.00185 0.00732 -0.252 8.01e- 1
## 14 sexHomem:i_found_partner__sincere -0.00353 0.00973 -0.363 7.17e- 1
Mais fácil ver em forma de gráfico
dwplot(tidy(lm_fit), dot_args = list(size = 2, color = "darkblue"),
whisker_args = list(color = "darkblue"),
vline = geom_vline(xintercept = 0, colour = "darkblue", linetype = 2)) +
theme_minimal()Podemos usar a função predict() para gerar estimativas para valores de y dados novos valores de x
medias_i_found <- dados_finais %>%
select(
starts_with("i_found_partner__"),
sex
) %>%
pivot_longer(
cols = -c(sex),
names_to = "i_found_partner",
names_pattern = "i_found_partner__(.*)",
values_to = "degree"
) %>%
mutate(
degree = as.numeric(degree)
) %>%
group_by(
sex,
i_found_partner
) %>%
summarise(
p10 = quantile(degree, probs = 0.1, na.rm = TRUE),
p90 = quantile(degree, probs = 0.9, na.rm = TRUE),
p25 = quantile(degree, probs = 0.25, na.rm = TRUE),
p75 = quantile(degree, probs = 0.75, na.rm = TRUE),
p33 = quantile(degree, probs = 0.33, na.rm = TRUE),
p67 = quantile(degree, probs = 0.67, na.rm = TRUE),
mean = mean(degree, na.rm = TRUE)
) %>%
pivot_wider(
names_from = i_found_partner,
values_from = c(mean, p10, p90, p25, p75, p33, p67)
)
med_h <- medias_i_found %>%
filter(
sex == "Homem"
)
med_m <- medias_i_found %>%
filter(
sex == "Mulher"
)
pontos_novos <-
tribble(
~attractive, ~ambitious, ~fun, ~intelligent, ~interests, ~sincere, ~sex, ~nome,
med_h$mean_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "Média",
med_h$p10_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P10",
med_h$p25_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P25",
med_h$p90_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P90",
med_h$p75_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P75",
med_h$p33_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P33",
med_h$p67_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P67",
med_m$mean_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "Média",
med_m$p10_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P10",
med_m$p90_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P90",
med_m$p25_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P25",
med_m$p75_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P75",
med_m$p33_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P33",
med_m$p67_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P67"
) %>%
rename_with(
.cols = -c(sex, nome),
.fn = ~str_glue("i_found_partner__{.x}")
)
conf_int_pred <- predict(lm_fit,
new_data = pontos_novos,
type = "conf_int")
mean_pred <- predict(lm_fit,
new_data = pontos_novos
)
dados_pred <- pontos_novos %>%
bind_cols(
conf_int_pred
) %>%
bind_cols(
mean_pred
) %>%
view()
ggplot(dados_pred, aes(x = i_found_partner__attractive)) +
geom_point(aes(y = .pred, color = sex)) +
geom_errorbar(aes(ymin = .pred_lower,
ymax = .pred_upper, color = sex),
width = .2) +
labs(y = "Prob. I like partner")+
# geom_mark_circle(
# aes(
# y = .pred,
# label = nome,
# group = interaction(sex, nome),
# color = sex,
# fill = sex
# ),
# label.fontsize = 7,
# con.cap = 1,
# expand = 0.001,
# label.buffer = unit(1, 'mm'),
# show.legend = FALSE
# ) +
theme_minimal() +
theme(
legend.position = "top"
) +
geom_line(
aes(
color = sex,
y = .pred
)
) +
scale_color_manual(
values = escala_sexo
) +
scale_x_continuous(
breaks = 1:10
) +
scale_y_continuous(
breaks = seq(0, to = 1, by= 0.2),
limits = c(0,1),
label = percent_format(accuracy = 1)
)Agora vamos sair do modelo linear e rodar uma rede neural
dados_finais_nao_nulos_sex_numerico <- dados_finais %>%
mutate(
sex = if_else(sex == "Homem", 1, 0) ,
i_liked_partner = as.numeric(i_liked_partner),
) %>%
filter(
across(
.cols = everything(),
.fns = ~!is.na(.x)
)
)
pontos_novos_rand_for <- pontos_novos %>%
mutate(
sex = if_else(sex == "Homem", 1, 0)
)
set.seed(192)
modelo_nnet <- mlp(mode = "regression", hidden_units = 10 ) %>%
set_engine("nnet")
modelo_nnet## Single Layer Neural Network Specification (regression)
##
## Main Arguments:
## hidden_units = 10
##
## Computational engine: nnet
fit_nnet <- modelo_nnet %>% fit( i_liked_partner ~
i_found_partner__attractive +
i_found_partner__ambitious +
i_found_partner__fun +
i_found_partner__intelligent +
i_found_partner__interests +
i_found_partner__sincere +
sex,
data = dados_finais_nao_nulos_sex_numerico)
fit_nnet## parsnip model object
##
## Fit time: 940ms
## a 7-10-1 network with 91 weights
## inputs: i_found_partner__attractive i_found_partner__ambitious i_found_partner__fun i_found_partner__intelligent i_found_partner__interests i_found_partner__sincere sex
## output(s): i_liked_partner
## options were - linear output units
No caso da rede neural, as relações não precisam ser lineares. É o caso aqui
mean_pred <- predict(fit_nnet,
new_data = pontos_novos_rand_for
)
dados_pred_nnet <- pontos_novos %>%
bind_cols(
mean_pred
)
ggplot(dados_pred_nnet, aes(x = i_found_partner__attractive)) +
geom_point(aes(y = .pred, color = sex)) +
labs(y = "urchin size")+
geom_mark_circle(
aes(
y = .pred,
label = nome,
group = interaction(nome, sex),
color = sex,
fill = sex
),
label.fontsize = 8,
con.cap = 1,
expand = 0.001,
label.buffer = unit(3.5, 'mm'),
show.legend = FALSE
) +
theme_minimal() +
theme(
legend.position = "top"
) +
geom_line(
aes(
color = sex,
y = .pred
)
) +
scale_color_manual(
values = escala_sexo
) +
scale_y_continuous(
breaks = seq(0, to = 1, by= 0.2),
limits = c(0,1)
) +
labs(y = "Prob. I like partner")É importante fazer uma sessão de exploração, que pode ser muito mais detalhada do que a que fizemos.
A sessão de exploração nos ajuda fazer alguns testes de sanidade nos dados e a extrair alguns insights que podem ou não ser usados para construir o processo Feature Engineering que pode ajudar o modelo atingir melhores resultados.
O processo de feature engineering é o lugar onde mais podemos melhorar o tipo de modelo que vamos usar na maioria das vezes.
A dependência desse processo é menor quando usamos modelos muito complexos, de deep learning, mas para isso é necessário ter uma quantidade colossal de dados.
Tudo o que fizermos durante o processo de seleção do modelo, como já vimos, deve ser feito nos dados de treinamento (que tambem servirão como validação).
Após a escolha de UM modelo, vamos avaliá-lo nos dados de teste.
Fonte: Feature Engineering and Selection: A Practical Approach for Predictive Models (Kuhn e Johnson)
A biblioteca rsamples oferece a infraestrutura necessária para retirar amostras dos dados disponíveis.
Usamos ela aqui para isolar os dados de teste.
Ela será usada novamente para criar as amostras usadas no cross-validation.
set.seed() é usada para manter a reprodutibilidade. Com a mesma semente, garantimos que a cada execução do script a mesma sequência (pseudo)aleatória será gerada.
O parâmetro strata garante que o balanceamento de um dos atributos (no caso o que usaremos como saída) será mantido nas duas partições.
dados_classificacao <- dados_finais %>%
mutate(
i_liked_partner = if_else(i_liked_partner == 1, "Liked", "Not") %>% factor(levels = c("Liked","Not"))
)
set.seed(123)
# Put 3/4 of the data into the training set
split_dado <- initial_split(
data = dados_classificacao,
strata = i_liked_partner,
prop = 3/4
)
# Create data frames for the two sets:
dado_treino <- training(split_dado)
dado_teste <- testing(split_dado)| Name | dado_treino |
| Number of rows | 3664 |
| Number of columns | 64 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 54 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | Hom: 1845, Mul: 1819 |
| choice | 0 | 1 | FALSE | 2 | ext: 3076, lim: 588 |
| partner_race | 0 | 1 | FALSE | 5 | Whi: 2015, Asi: 919, Lat: 270, Oth: 267 |
| frequency_date | 0 | 1 | TRUE | 7 | Twi: 968, Sev: 920, Onc: 706, Nev: 602 |
| career_macro | 0 | 1 | FALSE | 17 | Aca: 1088, Ban: 932, Cre: 320, Law: 285 |
| opinion_duration_of_date | 0 | 1 | FALSE | 3 | Too: 2116, Jus: 1473, Too: 75 |
| race | 0 | 1 | FALSE | 5 | Whi: 2049, Asi: 895, Lat: 302, Oth: 258 |
| goal | 0 | 1 | FALSE | 6 | Fun: 1528, Mee: 1351, Dat: 281, To : 215 |
| i_liked_partner | 0 | 1 | FALSE | 2 | Not: 2017, Lik: 1647 |
| partner_career_macro | 0 | 1 | FALSE | 17 | Aca: 1055, Ban: 970, Cre: 303, Law: 283 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| match | 0 | 1 | 0.18 | 0.39 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| same_race | 0 | 1 | 0.40 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| partner_liked_me | 0 | 1 | 0.43 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| met_before | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| unique_id_number | 0 | 1 | 283.23 | 157.01 | 4.00 | 158.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| id_within_wave | 0 | 1 | 9.14 | 5.60 | 1.00 | 4.00 | 8.00 | 14.00 | 22.00 | ▇▆▅▃▂ |
| subject_within_gender | 0 | 1 | 17.69 | 11.17 | 1.00 | 8.00 | 16.00 | 27.00 | 44.00 | ▇▇▅▅▂ |
| n_people_met_in_wave | 0 | 1 | 17.02 | 4.35 | 5.00 | 14.00 | 18.00 | 20.00 | 22.00 | ▁▂▂▃▇ |
| position_meeting | 0 | 1 | 9.09 | 5.48 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▅▂ |
| order_meeting | 0 | 1 | 8.85 | 5.45 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partnet_id_within_wave | 0 | 1 | 9.23 | 5.49 | 1.00 | 5.00 | 9.00 | 14.00 | 22.00 | ▇▆▅▅▂ |
| partner_unique_id_number | 0 | 1 | 283.12 | 156.95 | 4.00 | 158.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| interests_correlation | 0 | 1 | 0.19 | 0.30 | -0.83 | -0.02 | 0.21 | 0.43 | 0.91 | ▁▃▇▇▂ |
| my_age | 0 | 1 | 26.14 | 3.45 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_age | 0 | 1 | 26.19 | 3.35 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_stated_pref_time0_attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 1.00 | ▇▃▁▁▁ |
| partner_stated_pref_time0_sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▁▇▇▁▁ |
| partner_stated_pref_time0_intelligent | 0 | 1 | 0.21 | 0.07 | 0.00 | 0.18 | 0.20 | 0.25 | 0.50 | ▁▇▃▁▁ |
| partner_stated_pref_time0_fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| partner_stated_pref_time0_ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| partner_stated_pref_time0_shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▅▇▇▃▁ |
| importance_same_race | 0 | 1 | 3.87 | 2.85 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▃▂▂ |
| importance_same_religion | 0 | 1 | 3.60 | 2.83 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▂▃▂▁ |
| happy_expec | 0 | 1 | 5.52 | 1.77 | 1.00 | 5.00 | 6.00 | 7.00 | 10.00 | ▁▃▇▅▁ |
| i_found_partner__attractive | 0 | 1 | 6.23 | 1.93 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| i_found_partner__sincere | 0 | 1 | 7.19 | 1.73 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__intelligent | 0 | 1 | 7.42 | 1.51 | 0.00 | 7.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__fun | 0 | 1 | 6.46 | 1.94 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▇▇▃ |
| i_found_partner__ambitious | 0 | 1 | 6.82 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▆▇▃ |
| i_found_partner__interests | 0 | 1 | 5.54 | 2.14 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| partner_found_me__attractive | 0 | 1 | 6.20 | 1.95 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| partner_found_me__sincere | 0 | 1 | 7.18 | 1.73 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__intelligent | 0 | 1 | 7.39 | 1.54 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__fun | 0 | 1 | 6.42 | 1.93 | 0.00 | 5.00 | 7.00 | 8.00 | 11.00 | ▁▂▇▇▂ |
| partner_found_me__ambitious | 0 | 1 | 6.75 | 1.80 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| partner_found_me__interests | 0 | 1 | 5.51 | 2.14 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| probability_i_find_partner_liked_me | 0 | 1 | 5.33 | 2.12 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| you_look_for__attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 0.90 | ▇▇▁▁▁ |
| you_look_for__sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▁▇▇▁▁ |
| you_look_for__intelligent | 0 | 1 | 0.20 | 0.07 | 0.00 | 0.18 | 0.20 | 0.24 | 0.50 | ▁▇▃▁▁ |
| you_look_for__fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| you_look_for__ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| you_look_for__shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▅▇▇▃▁ |
| you_perceive_yourself__attractive | 0 | 1 | 7.13 | 1.42 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| you_perceive_yourself__sincere | 0 | 1 | 8.33 | 1.43 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| you_perceive_yourself__fun | 0 | 1 | 7.75 | 1.58 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| you_perceive_yourself__intelligent | 0 | 1 | 8.50 | 1.07 | 3.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▂▅▇ |
| you_perceive_yourself__ambitious | 0 | 1 | 7.64 | 1.84 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
| probability_partner_find_i_liked_partner | 0 | 1 | 5.27 | 2.14 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| partner_perceives_himself__attractive | 0 | 1 | 7.10 | 1.40 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| partner_perceives_himself__sincere | 0 | 1 | 8.35 | 1.42 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| partner_perceives_himself__fun | 0 | 1 | 7.72 | 1.57 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| partner_perceives_himself__intelligent | 0 | 1 | 8.46 | 1.09 | 3.00 | 8.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▆▇ |
| partner_perceives_himself__ambitious | 0 | 1 | 7.65 | 1.80 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▆ |
| Name | dado_teste |
| Number of rows | 1221 |
| Number of columns | 64 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 54 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | Hom: 611, Mul: 610 |
| choice | 0 | 1 | FALSE | 2 | ext: 1023, lim: 198 |
| partner_race | 0 | 1 | FALSE | 5 | Whi: 668, Asi: 291, Lat: 116, Oth: 90 |
| frequency_date | 0 | 1 | TRUE | 7 | Twi: 333, Sev: 322, Onc: 204, Nev: 191 |
| career_macro | 0 | 1 | FALSE | 16 | Aca: 384, Ban: 304, Law: 116, Cre: 99 |
| opinion_duration_of_date | 0 | 1 | FALSE | 3 | Too: 670, Jus: 523, Too: 28 |
| race | 0 | 1 | FALSE | 5 | Whi: 638, Asi: 295, Oth: 114, Lat: 104 |
| goal | 0 | 1 | FALSE | 6 | Fun: 526, Mee: 442, Dat: 91, Oth: 58 |
| i_liked_partner | 0 | 1 | FALSE | 2 | Not: 672, Lik: 549 |
| partner_career_macro | 0 | 1 | FALSE | 17 | Aca: 341, Ban: 322, Cre: 105, Law: 96 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| match | 0 | 1 | 0.17 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| same_race | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| partner_liked_me | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| met_before | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| unique_id_number | 0 | 1 | 282.60 | 156.57 | 4.00 | 160.00 | 274.00 | 408.00 | 552.00 | ▆▆▇▆▇ |
| id_within_wave | 0 | 1 | 8.98 | 5.46 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| subject_within_gender | 0 | 1 | 17.37 | 10.87 | 1.00 | 8.00 | 16.00 | 26.00 | 44.00 | ▇▇▆▅▂ |
| n_people_met_in_wave | 0 | 1 | 16.97 | 4.27 | 5.00 | 15.00 | 18.00 | 20.00 | 22.00 | ▁▃▂▅▇ |
| position_meeting | 0 | 1 | 9.23 | 5.53 | 1.00 | 4.00 | 9.00 | 14.00 | 22.00 | ▇▆▅▅▂ |
| order_meeting | 0 | 1 | 8.95 | 5.47 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▅▂ |
| partnet_id_within_wave | 0 | 1 | 8.81 | 5.57 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partner_unique_id_number | 0 | 1 | 282.29 | 157.14 | 4.00 | 156.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| interests_correlation | 0 | 1 | 0.21 | 0.31 | -0.83 | -0.01 | 0.23 | 0.44 | 0.91 | ▁▃▇▇▂ |
| my_age | 0 | 1 | 26.19 | 3.39 | 18.00 | 24.00 | 26.00 | 28.00 | 42.00 | ▂▇▅▁▁ |
| partner_age | 0 | 1 | 26.18 | 3.58 | 18.00 | 23.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_stated_pref_time0_attractive | 0 | 1 | 0.23 | 0.12 | 0.00 | 0.15 | 0.20 | 0.25 | 0.90 | ▆▇▂▁▁ |
| partner_stated_pref_time0_sincere | 0 | 1 | 0.17 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▂▇▇▁▁ |
| partner_stated_pref_time0_intelligent | 0 | 1 | 0.20 | 0.07 | 0.00 | 0.17 | 0.20 | 0.23 | 0.50 | ▂▇▃▁▁ |
| partner_stated_pref_time0_fun | 0 | 1 | 0.18 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| partner_stated_pref_time0_ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▇▇▇▁▁ |
| partner_stated_pref_time0_shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.08 | 0.11 | 0.16 | 0.30 | ▆▇▇▅▁ |
| importance_same_race | 0 | 1 | 3.68 | 2.79 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▂▂▁ |
| importance_same_religion | 0 | 1 | 3.66 | 2.90 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▂▂▂ |
| happy_expec | 0 | 1 | 5.43 | 1.82 | 1.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▁▃▇▅▁ |
| i_found_partner__attractive | 0 | 1 | 6.29 | 1.97 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▃ |
| i_found_partner__sincere | 0 | 1 | 7.29 | 1.71 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__intelligent | 0 | 1 | 7.46 | 1.55 | 0.00 | 7.00 | 8.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__fun | 0 | 1 | 6.54 | 1.94 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| i_found_partner__ambitious | 0 | 1 | 6.83 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▆▇▃ |
| i_found_partner__interests | 0 | 1 | 5.57 | 2.15 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| partner_found_me__attractive | 0 | 1 | 6.23 | 1.89 | 1.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| partner_found_me__sincere | 0 | 1 | 7.13 | 1.74 | 1.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__intelligent | 0 | 1 | 7.36 | 1.50 | 1.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__fun | 0 | 1 | 6.45 | 1.97 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▇▇▃ |
| partner_found_me__ambitious | 0 | 1 | 6.80 | 1.76 | 1.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| partner_found_me__interests | 0 | 1 | 5.48 | 2.11 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▅▂ |
| probability_i_find_partner_liked_me | 0 | 1 | 5.36 | 2.12 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▃▇▅▁ |
| you_look_for__attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 0.90 | ▇▇▁▁▁ |
| you_look_for__sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.19 | 0.20 | 0.47 | ▁▇▇▁▁ |
| you_look_for__intelligent | 0 | 1 | 0.20 | 0.06 | 0.00 | 0.18 | 0.20 | 0.23 | 0.50 | ▁▇▃▁▁ |
| you_look_for__fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| you_look_for__ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| you_look_for__shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.08 | 0.11 | 0.16 | 0.30 | ▆▇▆▅▁ |
| you_perceive_yourself__attractive | 0 | 1 | 7.18 | 1.38 | 2.00 | 7.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| you_perceive_yourself__sincere | 0 | 1 | 8.33 | 1.45 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| you_perceive_yourself__fun | 0 | 1 | 7.83 | 1.58 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▇ |
| you_perceive_yourself__intelligent | 0 | 1 | 8.49 | 1.08 | 3.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▂▆▇ |
| you_perceive_yourself__ambitious | 0 | 1 | 7.68 | 1.80 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
| probability_partner_find_i_liked_partner | 0 | 1 | 5.27 | 2.05 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▃▇▅▁ |
| partner_perceives_himself__attractive | 0 | 1 | 7.15 | 1.39 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▂▇▂ |
| partner_perceives_himself__sincere | 0 | 1 | 8.27 | 1.49 | 2.00 | 7.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| partner_perceives_himself__fun | 0 | 1 | 7.75 | 1.57 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| partner_perceives_himself__intelligent | 0 | 1 | 8.46 | 1.06 | 4.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▂▅▇ |
| partner_perceives_himself__ambitious | 0 | 1 | 7.59 | 1.89 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
Conforme comentamos, as bibliotecas da tidymodels são ortogonais.
A biblioteca recipes serve a um fim específico: definir os passos do pré-processamento dos dados. Esses passos podem ser definidos de forma independente da definição do modelo a ser usado, da forma de cross-validation, da medição da performance etc.
Um dos passos que podem ser definidos na recipes é a identificação de atributos que não têm papel preditivo e, portanto, não devem ser usados no treinamento e na predição, mas que queremos manter no nosso tibble para identificação das linhas.
receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
)
summary(receita)## # A tibble: 64 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 sex nominal predictor original
## 2 choice nominal predictor original
## 3 partner_race nominal predictor original
## 4 frequency_date nominal predictor original
## 5 career_macro nominal predictor original
## 6 opinion_duration_of_date nominal predictor original
## 7 race nominal predictor original
## 8 goal nominal predictor original
## 9 match numeric ID original
## 10 same_race numeric predictor original
## # ... with 54 more rows
Alguns engines de modelos não trabalham bem com fatores.
Quem está acostumado com a lm, sabe que os fatores são transformados automaticamente em dummies, mas isso não acontece com todos os engines.
step_dummy() faz esse trabalho, ou seja, cria uma variável binária pra cada level do fator (menos um). Veja como podemos usar o seletor all_nominal() e o all_outcomes()
step_zv() retira as variáveis com variância zero. Isso vai acontecer bastante quando temos levels de fatores infrequentes.
receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_dummy(
all_nominal(), -all_outcomes()
) %>%
step_zv(all_predictors())
summary(receita) %>%
gt()| variable | type | role | source |
|---|---|---|---|
| sex | nominal | predictor | original |
| choice | nominal | predictor | original |
| partner_race | nominal | predictor | original |
| frequency_date | nominal | predictor | original |
| career_macro | nominal | predictor | original |
| opinion_duration_of_date | nominal | predictor | original |
| race | nominal | predictor | original |
| goal | nominal | predictor | original |
| match | numeric | ID | original |
| same_race | numeric | predictor | original |
| partner_liked_me | numeric | predictor | original |
| met_before | numeric | predictor | original |
| unique_id_number | numeric | ID | original |
| id_within_wave | numeric | ID | original |
| subject_within_gender | numeric | ID | original |
| n_people_met_in_wave | numeric | predictor | original |
| position_meeting | numeric | predictor | original |
| order_meeting | numeric | predictor | original |
| partnet_id_within_wave | numeric | ID | original |
| partner_unique_id_number | numeric | ID | original |
| interests_correlation | numeric | predictor | original |
| my_age | numeric | predictor | original |
| partner_age | numeric | predictor | original |
| partner_stated_pref_time0_attractive | numeric | predictor | original |
| partner_stated_pref_time0_sincere | numeric | predictor | original |
| partner_stated_pref_time0_intelligent | numeric | predictor | original |
| partner_stated_pref_time0_fun | numeric | predictor | original |
| partner_stated_pref_time0_ambitious | numeric | predictor | original |
| partner_stated_pref_time0_shared_interests | numeric | predictor | original |
| importance_same_race | numeric | predictor | original |
| importance_same_religion | numeric | predictor | original |
| happy_expec | numeric | predictor | original |
| i_found_partner__attractive | numeric | predictor | original |
| i_found_partner__sincere | numeric | predictor | original |
| i_found_partner__intelligent | numeric | predictor | original |
| i_found_partner__fun | numeric | predictor | original |
| i_found_partner__ambitious | numeric | predictor | original |
| i_found_partner__interests | numeric | predictor | original |
| partner_found_me__attractive | numeric | predictor | original |
| partner_found_me__sincere | numeric | predictor | original |
| partner_found_me__intelligent | numeric | predictor | original |
| partner_found_me__fun | numeric | predictor | original |
| partner_found_me__ambitious | numeric | predictor | original |
| partner_found_me__interests | numeric | predictor | original |
| probability_i_find_partner_liked_me | numeric | predictor | original |
| you_look_for__attractive | numeric | predictor | original |
| you_look_for__sincere | numeric | predictor | original |
| you_look_for__intelligent | numeric | predictor | original |
| you_look_for__fun | numeric | predictor | original |
| you_look_for__ambitious | numeric | predictor | original |
| you_look_for__shared_interests | numeric | predictor | original |
| you_perceive_yourself__attractive | numeric | predictor | original |
| you_perceive_yourself__sincere | numeric | predictor | original |
| you_perceive_yourself__fun | numeric | predictor | original |
| you_perceive_yourself__intelligent | numeric | predictor | original |
| you_perceive_yourself__ambitious | numeric | predictor | original |
| probability_partner_find_i_liked_partner | numeric | predictor | original |
| partner_career_macro | nominal | predictor | original |
| partner_perceives_himself__attractive | numeric | predictor | original |
| partner_perceives_himself__sincere | numeric | predictor | original |
| partner_perceives_himself__fun | numeric | predictor | original |
| partner_perceives_himself__intelligent | numeric | predictor | original |
| partner_perceives_himself__ambitious | numeric | predictor | original |
| i_liked_partner | nominal | outcome | original |
Alguns fatores são ordinais. Pode ser uma boa ideia codificá-los em uma só variável numérica, que vai manter a ordem natural dos levels.
No nosso exemplo, a variável que representa com qual frequência as pessoas saem à noite apresenta níveis que podem ser ordenados.
receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_ordinalscore(
frequency_date
) %>%
step_dummy(
all_nominal(), -frequency_date, -all_outcomes()
) %>%
step_zv(all_predictors()) Sabendo criar uma receita de pré-processamento e relembrando como criar uma interface genérica para um modelo com parsnip() e como selecionar um engine pra ele, podemos criar um pequeno fluxo de trabalho para realizar esse processamento, usando a biblioteca workflows()
lr_mod <-
logistic_reg() %>%
set_engine("glm")
wf <- workflow() %>%
add_recipe(receita) %>%
add_model(lr_mod)
wf## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_ordinalscore()
## * step_dummy()
## * step_zv()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
Com o workflow criado é possível estimá-lo ussando a função fit()
fit_teste <-
wf %>%
fit(
data = dado_treino
)
fit_teste %>% tidy() %>%
select(
term,
estimate,
p.value
) %>%
arrange(
p.value
) %>%
gt() %>%
fmt_number(
columns = one_of("estimate"),
decimals = 3
) %>%
fmt_number(
columns = one_of("p.value"),
decimals = 2
) | term | estimate | p.value |
|---|---|---|
| i_found_partner__attractive | −0.573 | 0.00 |
| probability_i_find_partner_liked_me | −0.308 | 0.00 |
| i_found_partner__interests | −0.222 | 0.00 |
| i_found_partner__fun | −0.279 | 0.00 |
| you_look_for__attractive | 5.085 | 0.00 |
| you_perceive_yourself__fun | 0.171 | 0.00 |
| race_White | 0.620 | 0.00 |
| sex_Homem | −0.536 | 0.00 |
| partner_found_me__attractive | 0.120 | 0.00 |
| career_macro_Politics | 2.790 | 0.00 |
| you_perceive_yourself__sincere | 0.133 | 0.00 |
| same_race | −0.376 | 0.00 |
| importance_same_race | 0.063 | 0.00 |
| i_found_partner__ambitious | 0.123 | 0.00 |
| goal_Serious | −0.984 | 0.00 |
| career_macro_Pro.sports.Athletics | −4.469 | 0.00 |
| probability_partner_find_i_liked_partner | −0.083 | 0.00 |
| you_look_for__ambitious | 4.241 | 0.00 |
| partner_career_macro_Speech.Pathology | 2.574 | 0.00 |
| n_people_met_in_wave | −0.054 | 0.00 |
| i_found_partner__sincere | 0.110 | 0.01 |
| career_macro_Lawyer | 0.515 | 0.01 |
| partner_career_macro_Psychologist | 0.729 | 0.01 |
| partner_career_macro_Creative.Arts.Entertainment | 0.443 | 0.01 |
| career_macro_Other | 1.721 | 0.03 |
| goal_Other | 0.623 | 0.03 |
| frequency_date | 0.077 | 0.03 |
| importance_same_religion | 0.040 | 0.03 |
| partner_liked_me | 0.229 | 0.04 |
| partner_found_me__interests | 0.060 | 0.04 |
| partner_career_macro_Journalism | −1.347 | 0.04 |
| career_macro_Social.Work | −0.596 | 0.04 |
| my_age | 0.029 | 0.05 |
| partner_career_macro_Real.Estate | 1.255 | 0.05 |
| choice_limited | −0.429 | 0.05 |
| happy_expec | −0.057 | 0.06 |
| opinion_duration_of_date_Too.little | 0.188 | 0.06 |
| race_Latino | 0.352 | 0.07 |
| you_perceive_yourself__attractive | 0.078 | 0.08 |
| you_look_for__sincere | 1.770 | 0.08 |
| you_look_for__intelligent | 1.716 | 0.09 |
| partner_perceives_himself__intelligent | 0.089 | 0.09 |
| opinion_duration_of_date_Too.much | −0.538 | 0.09 |
| career_macro_Creative.Arts.Entertainment | 0.312 | 0.09 |
| partner_perceives_himself__fun | −0.059 | 0.11 |
| partner_career_macro_Lawyer | −0.311 | 0.12 |
| partner_race_Others | −0.314 | 0.12 |
| partner_career_macro_Pro.sports.Athletics | −2.273 | 0.12 |
| (Intercept) | 2.235 | 0.13 |
| partner_found_me__intelligent | −0.061 | 0.17 |
| goal_Meet.new.people | −0.244 | 0.18 |
| career_macro_Engineer | 0.390 | 0.19 |
| career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin | −0.168 | 0.20 |
| partner_stated_pref_time0_intelligent | −1.271 | 0.20 |
| partner_perceives_himself__ambitious | 0.040 | 0.22 |
| partner_career_macro_Doctor.Medicine | −0.248 | 0.22 |
| career_macro_Doctor.Medicine | 0.231 | 0.24 |
| partner_career_macro_Social.Work | 0.354 | 0.26 |
| partner_perceives_himself__sincere | −0.041 | 0.26 |
| position_meeting | −0.010 | 0.26 |
| you_look_for__fun | −1.132 | 0.27 |
| race_Black | −0.278 | 0.27 |
| partner_found_me__ambitious | −0.037 | 0.27 |
| career_macro_Journalism | −0.690 | 0.27 |
| you_perceive_yourself__ambitious | −0.033 | 0.30 |
| interests_correlation | 0.150 | 0.33 |
| partner_career_macro_Engineer | 0.265 | 0.38 |
| i_found_partner__intelligent | −0.043 | 0.38 |
| partner_perceives_himself__attractive | 0.036 | 0.40 |
| career_macro_Psychologist | −0.223 | 0.40 |
| career_macro_Speech.Pathology | 0.986 | 0.42 |
| partner_career_macro_Architecture | −0.810 | 0.42 |
| career_macro_Undecided | −0.187 | 0.43 |
| partner_found_me__sincere | −0.027 | 0.46 |
| partner_stated_pref_time0_sincere | 0.732 | 0.48 |
| career_macro_Architecture | −0.753 | 0.50 |
| partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin | −0.078 | 0.55 |
| race_Others | −0.121 | 0.56 |
| partner_race_Black | −0.124 | 0.57 |
| career_macro_Real.Estate | −0.987 | 0.60 |
| partner_career_macro_Other | −0.298 | 0.60 |
| career_macro_International.Humanitarian.Affairs | 0.094 | 0.65 |
| partner_stated_pref_time0_ambitious | 0.479 | 0.71 |
| partner_career_macro_International.Humanitarian.Affairs | 0.063 | 0.74 |
| goal_To.say | −0.084 | 0.75 |
| goal_Fun | 0.054 | 0.76 |
| partner_stated_pref_time0_fun | 0.274 | 0.79 |
| partner_stated_pref_time0_attractive | −0.134 | 0.87 |
| partner_career_macro_Undecided | 0.035 | 0.88 |
| met_before | 0.027 | 0.90 |
| you_perceive_yourself__intelligent | −0.006 | 0.91 |
| partner_found_me__fun | −0.004 | 0.91 |
| partner_race_White | −0.014 | 0.91 |
| partner_race_Latino | −0.015 | 0.94 |
| order_meeting | 0.000 | 0.97 |
| partner_age | 0.001 | 0.97 |
| partner_career_macro_Politics | −0.022 | 0.97 |
| partner_stated_pref_time0_shared_interests | NA | NA |
| you_look_for__shared_interests | NA | NA |
Um tipo de passo interessante que pode ser adicionado à recipe é a adição de novos preditores que representam interações entre os preditores originais.
É interessante poder usar as funções helpers, que tornam muito mais fácil a adição de vários termos de interação de uma só vez.
Aqui vamos usar uma regressão mesmo ao invés de classificação para facilitar a interpretação dos coeficientes.
dado_treino_regressao <- dado_treino %>%
mutate(
i_liked_partner = if_else(i_liked_partner == "Liked", 1, 0)
)
receita_com_interacao <- recipe(i_liked_partner ~ ., data = dado_treino_regressao) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_ordinalscore(
frequency_date
) %>%
step_dummy(
all_nominal(), -frequency_date, -all_outcomes()
) %>%
step_interact(
terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
) %>%
step_interact(
terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("interests_correlation")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_age")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("career_macro")*starts_with("partner_career_macro")
) %>%
step_zv(all_predictors()) Com a adição destes termos de interação, ficamos com muitos preditores, vários deles bem correlacionados.
O uso de termos muito correlacionados (não só em pares) pode levar ao fenômeno da colinearidade. O modelo pode atribuir efeito a uma ou outra variável de entrada de acordo com a amostra usada pra treinamento, ficando, portanto, com maior variância.
Uma forma de evitar que muitas variáveis sejam efetivamente usadas no modelo é aplicar uma penalidade de forma a diminuir o número de coeficientes acionados, e, por consequência a variância do modelo. É isso que as regressões do tipo Ridge e Lasso fazem.
Essas regressões que o penalizam o número e o tamanho do efeito das relações entre as variáveis de entrada e os coeficientes trocam viés por menor variância e mais interpretabilidade: elas têm mais viés, se adaptam menos ao conjunto de treinamento, mas não variam tanto dependendo de qual amostra da população foi escolhida para o treinamento. Além disso, por ter menos coeficientes “ligados,” é mais interpretável.
O modelo Elastic Net conjuga a penalização do tipo Ridge com a penalização do tipo Lasso modificando a função de penalização da regressão, que na regressão de mínimos quadrados ordinários, a mais comum, como o nome diz, é o erro quadrático:
\[RSS = \sum_{i = 1}^{n} ( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 \]
Para a regressão Ridge, os coeficientes são penalizados de forma quadrática. Isso diminui a variância do modelo mas não diminui tanto a diminuição do número de coeficientes diferentes de 0:
\[Loss_{Ridge} = RSS + \lambda \sum_{j=1}^{p}\beta_j^2 \]
Para a regressão Lasso, os coeficientes são penalizados pelo seu valor absoluto. Isso diminui a variância do modelo E diminui o número de coeficientes diferentes de 0, favorecendo a interpretabilidade
\[Loss_{Lasso} = RSS + \lambda \sum_{j=1}^{p} \left| \beta_j \right| \]
O conceito de interface é importante na engenharia de software.
É sempre melhor depender de interfaces do que de implementações. A parnsip funciona como uma camada de abstração que oferece uma interface única que se encarrega de cuidar da chmada às diferentes implementações.
As interfaces genéricas da parnsip estão prontas para receber os parâmetros mais comuns usados nos engines e estão preparadas para passar ao engine o valor destes parâmetros.
No caso das regressões lineares, alguns engines como o glmnet estão preparados para receber os hiperparâmetros necessários para a implementação da regressão lasso-ridge.
lr_mod <-
linear_reg(penalty = .02, mixture = 1) %>%
set_engine("glmnet")
wf_com_interacao <- workflow() %>%
add_recipe(receita_com_interacao) %>%
add_model(lr_mod)
wf_com_interacao## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 10 Recipe Steps
##
## * step_ordinalscore()
## * step_dummy()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_zv()
##
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
##
## Main Arguments:
## penalty = 0.02
## mixture = 1
##
## Computational engine: glmnet
Na estimação com Elastic Net temos menos preditores acionados
fit_com_interacao <-
wf_com_interacao %>%
fit(
data = dado_treino_regressao
)
fit_com_interacao %>% tidy() %>%
filter(
estimate != 0
) %>%
arrange(
estimate %>% abs() %>% desc()
) %>%
select(
term,
estimate
) %>%
gt() %>%
fmt_number(
columns = one_of("estimate"),
decimals = 3
)| term | estimate |
|---|---|
| (Intercept) | −0.245 |
| you_look_for__attractive | −0.145 |
| i_found_partner__attractive | 0.075 |
| you_look_for__shared_interests | 0.065 |
| career_macro_Politics | −0.040 |
| probability_i_find_partner_liked_me | 0.028 |
| i_found_partner__interests | 0.024 |
| i_found_partner__fun | 0.020 |
| goal_Other | −0.016 |
| race_White | −0.014 |
| you_perceive_yourself__attractive | −0.012 |
| partner_found_me__attractive | −0.010 |
| i_found_partner__attractive_x_sex_Homem | 0.010 |
| career_macro_Creative.Arts.Entertainment_x_partner_career_macro_Creative.Arts.Entertainment | −0.004 |
| you_perceive_yourself__sincere | −0.003 |
| probability_partner_find_i_liked_partner | 0.003 |
| importance_same_race | −0.002 |
| you_perceive_yourself__fun | −0.002 |
| race_White_x_importance_same_race | −0.001 |
| importance_same_religion | −0.001 |
| happy_expec | 0.001 |
Aqui adicionamos mais alguns passos de pré-processamento:
step_mutate() cria um novo preditor baseado em um cálculo em cima de outros
step_poly() adiciona preditores elevados a números naturais maiores que 1 (por default 2) possibilitando a representação de efeitos não lineares (quadráticos, por exemplo)
step_corr() retira preditores de forma que não haja mais nenhum par com correlação acima do threshold.
receita_com_interacao_class <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_ordinalscore(
frequency_date
) %>%
step_dummy(
all_nominal(), -frequency_date, -all_outcomes()
) %>%
step_mutate(
diff_age = my_age - partner_age
) %>%
step_poly(
diff_age
) %>%
step_interact(
terms = ~ starts_with("diff_age") * starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
) %>%
step_interact(
terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("interests_correlation")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_age")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_found_me__")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("you_look_for__")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("you_look_for__")*starts_with("i_found_partner")
) %>%
step_interact(
terms = ~ starts_with("probability_partner_find_i_liked_partner")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("career_")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("partner_career_")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("goal")*starts_with("sex")
) %>%
step_corr(
all_predictors(),
threshold = 0.8
) %>%
step_zv(all_predictors())
log_mod <-
logistic_reg() %>%
set_engine("glm")
wf_com_interacao_class <- workflow() %>%
add_recipe(receita_com_interacao_class) %>%
add_model(log_mod)
fit_com_interacao_class <-
wf_com_interacao_class %>%
fit(
data = dado_treino
)A curva ROC (Receiver Operating Characteristics) foi inventada na época da Segunda Guerra Mundial para avaliar se os operadores de radar americanos estavam detectando confiavelmente aeronaves japonesas a partir de sinais de radar.
A curva mostra, para vários thresholds, qual a fração de verdadeiros positivos (ou sensibilidade) e a fração de falsos positivos (fall-out, ou \(1 - especificidade\) ).
Uma métrica numérica que traduz o a precisão geral de um modelo de classificação consiste na área embaixo desta curva (AUC, Area Under the Curve). Note que quanto mais perto de um essa área, menor a taxa de falsos positivos e maior a sensibilidade
Aqui executamos a predição nos próprios dados de treinamento e usamos função roc_curve() da yardstick para gerar os dados necessários para plotara curva
pred_like <- predict(
object = fit_com_interacao_class,
new_data = dado_teste,
type = "prob"
) %>%
bind_cols(dado_teste %>% select(i_liked_partner))
dados_roc <- pred_like %>%
roc_curve(
truth = i_liked_partner,
.pred_Liked
)
dados_roc %>%
filter(
row_number() %% 100 == 0
) %>%
gt() %>%
fmt_number(
columns = everything(),
decimals = 3
)| .threshold | specificity | sensitivity |
|---|---|---|
| 0.015 | 0.137 | 0.989 |
| 0.048 | 0.277 | 0.978 |
| 0.109 | 0.409 | 0.958 |
| 0.188 | 0.543 | 0.940 |
| 0.297 | 0.656 | 0.896 |
| 0.446 | 0.751 | 0.831 |
| 0.542 | 0.833 | 0.749 |
| 0.669 | 0.896 | 0.643 |
| 0.787 | 0.936 | 0.510 |
| 0.871 | 0.967 | 0.366 |
| 0.939 | 0.988 | 0.209 |
| 0.991 | 0.999 | 0.040 |
Plotamos, então, a curva
ponto_gatilho <- dados_roc %>%
filter(
.threshold > 0.5
) %>%
slice_min(
n = 1, order_by = .threshold
)
dados_roc %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_point(
data = ponto_gatilho,
aes(x = 1 - specificity, y = sensitivity),
size = 3,
color = "darkblue"
) +
geom_text_repel(
data = ponto_gatilho,
aes(
x = 1 - specificity + 0.15,
y = sensitivity - 0.15,
label = str_glue("{sensitivity %>% percent(accuracy = 0.1)}/{(1 - specificity) %>% percent(accuracy = 0.1)}")
)
) +
geom_abline(lty = 3) +
coord_equal() +
theme_bw()Existe uma função que plota a curva automaticamente
A função roc_auc calcula a área embaixo da curva
| .metric | .estimator | .estimate |
|---|---|---|
| roc_auc | binary | 0.8686356 |
receita_sem_interacao_class <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_ordinalscore(
frequency_date
) %>%
step_dummy(
all_nominal(), -frequency_date, -all_outcomes()
)
wf_sem_interacao_class <- workflow() %>%
add_recipe(receita_sem_interacao_class) %>%
add_model(log_mod)
fit_sem_interacao_class <-
wf_sem_interacao_class %>%
fit(
data = dado_treino
)
pred_like_sem <- predict(
object = fit_sem_interacao_class,
new_data = dado_teste,
type = "prob"
) %>%
bind_cols(dado_teste %>% select(i_liked_partner))
pred_like_sem %>%
roc_curve(
truth = i_liked_partner,
.pred_Liked
) %>%
autoplot()| .metric | .estimator | .estimate |
|---|---|---|
| roc_auc | binary | 0.8669198 |
As curvas ficaram muito parecidas.
Nao adiantou muito a adição de interações, mas isso pode ser causado pela simplicidade do modelo
rocs <- bind_rows(
roc_curve(pred_like_sem, i_liked_partner, .pred_Liked) %>% mutate(tipo = "Sem interação"),
roc_curve(pred_like, i_liked_partner, .pred_Liked) %>% mutate(tipo = "Com interação"),
)
rocs %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, color = tipo)) +
geom_path() +
geom_abline(lty = 3) +
coord_equal() +
theme_bw() +
theme(
legend.position = "top"
)A biblioteca rsamples, que já vimos, oferece a infraestrutura para geração das amostras necessárias para o Cross Validation.
Relembrando, temos que montar o seguinte esquema:
No código abaixo, fazemos duas divisões em 5 partes.
## # 5-fold cross-validation repeated 2 times using stratification
## # A tibble: 10 x 3
## splits id id2
## <list> <chr> <chr>
## 1 <split [2.9K/734]> Repeat1 Fold1
## 2 <split [2.9K/734]> Repeat1 Fold2
## 3 <split [2.9K/732]> Repeat1 Fold3
## 4 <split [2.9K/732]> Repeat1 Fold4
## 5 <split [2.9K/732]> Repeat1 Fold5
## 6 <split [2.9K/734]> Repeat2 Fold1
## 7 <split [2.9K/734]> Repeat2 Fold2
## 8 <split [2.9K/732]> Repeat2 Fold3
## 9 <split [2.9K/732]> Repeat2 Fold4
## 10 <split [2.9K/732]> Repeat2 Fold5
## Rows: 2,930
## Columns: 64
## $ sex <fct> Mulher, Mulher, Mulher, ...
## $ choice <fct> limited, limited, limite...
## $ partner_race <fct> White, White, White, Whi...
## $ frequency_date <ord> Once a month, Once a mon...
## $ career_macro <fct> Lawyer, Lawyer, Lawyer, ...
## $ opinion_duration_of_date <fct> Just Right, Just Right, ...
## $ race <fct> White, White, White, Whi...
## $ goal <fct> Fun, Fun, Fun, Fun, Fun,...
## $ match <dbl> 1, 0, 0, 0, 0, 0, 1, 0, ...
## $ same_race <dbl> 1, 1, 1, 1, 1, 0, 1, 0, ...
## $ partner_liked_me <dbl> 1, 1, 1, 0, 0, 1, 1, 0, ...
## $ i_liked_partner <fct> Liked, Not, Not, Not, Li...
## $ met_before <dbl> 1, 0, 0, 0, 0, 1, 1, 0, ...
## $ unique_id_number <dbl> 4, 4, 4, 4, 4, 5, 5, 5, ...
## $ id_within_wave <dbl> 4, 4, 4, 4, 4, 5, 5, 5, ...
## $ subject_within_gender <dbl> 7, 7, 7, 7, 7, 9, 9, 9, ...
## $ n_people_met_in_wave <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting <dbl> 6, 6, 6, 6, 6, 4, 4, 4, ...
## $ order_meeting <dbl> 4, 5, 10, 1, 8, 7, 2, 4,...
## $ partnet_id_within_wave <dbl> 4, 6, 7, 8, 10, 3, 4, 5,...
## $ partner_unique_id_number <dbl> 14, 16, 17, 18, 20, 13, ...
## $ interests_correlation <dbl> -0.18, 0.37, 0.35, 0.15,...
## $ my_age <dbl> 23, 23, 23, 23, 23, 21, ...
## $ partner_age <dbl> 23, 25, 30, 27, 24, 22, ...
## $ partner_stated_pref_time0_attractive <dbl> 0.3000000, 0.5000000, 0....
## $ partner_stated_pref_time0_sincere <dbl> 0.0500000, 0.0000000, 0....
## $ partner_stated_pref_time0_intelligent <dbl> 0.1500000, 0.3000000, 0....
## $ partner_stated_pref_time0_fun <dbl> 0.4000000, 0.1000000, 0....
## $ partner_stated_pref_time0_ambitious <dbl> 0.0500000, 0.0000000, 0....
## $ partner_stated_pref_time0_shared_interests <dbl> 0.0500000, 0.1000000, 0....
## $ importance_same_race <dbl> 1, 1, 1, 1, 1, 8, 8, 8, ...
## $ importance_same_religion <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ happy_expec <dbl> 1, 1, 1, 1, 1, 7, 7, 7, ...
## $ i_found_partner__attractive <dbl> 8, 5, 7, 5, 8, 4, 8, 6, ...
## $ i_found_partner__sincere <dbl> 10, 10, 10, 9, 9, 6, 5, ...
## $ i_found_partner__intelligent <dbl> 7, 8, 10, 9, 10, 8, 5, 7...
## $ i_found_partner__fun <dbl> 10, 4, 7, 5, 10, 8, 7, 5...
## $ i_found_partner__ambitious <dbl> 7, 8, 10, 9, 8, 6, 7, 4,...
## $ i_found_partner__interests <dbl> 10, 2, 5, 7, 8, 5, 9, 8,...
## $ partner_found_me__attractive <dbl> 7, 6, 7, 6, 7, 10, 6, 5,...
## $ partner_found_me__sincere <dbl> 7, 6, 6, 7, 8, 10, 8, 9,...
## $ partner_found_me__intelligent <dbl> 7, 7, 3, 8, 8, 10, 6, 9,...
## $ partner_found_me__fun <dbl> 9, 7, 5, 6, 7, 10, 8, 7,...
## $ partner_found_me__ambitious <dbl> 9, 8, 6, 6, 8, 10, 10, 9...
## $ partner_found_me__interests <dbl> 9, 7, 5, 5, 7, 10, 10, 5...
## $ probability_i_find_partner_liked_me <dbl> 10, 3, 1, 6, 8, 5, 2, 4,...
## $ you_look_for__attractive <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__sincere <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__intelligent <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__fun <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__ambitious <dbl> 0.10, 0.10, 0.10, 0.10, ...
## $ you_look_for__shared_interests <dbl> 0.10, 0.10, 0.10, 0.10, ...
## $ you_perceive_yourself__attractive <dbl> 7, 7, 7, 7, 7, 6, 6, 6, ...
## $ you_perceive_yourself__sincere <dbl> 8, 8, 8, 8, 8, 3, 3, 3, ...
## $ you_perceive_yourself__fun <dbl> 9, 9, 9, 9, 9, 6, 6, 6, ...
## $ you_perceive_yourself__intelligent <dbl> 7, 7, 7, 7, 7, 10, 10, 1...
## $ you_perceive_yourself__ambitious <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ probability_partner_find_i_liked_partner <dbl> 10, 6, 4, 7, 7, 10, 7, 7...
## $ partner_career_macro <fct> Lawyer, Banking/Consulti...
## $ partner_perceives_himself__attractive <dbl> 9, 6, 7, 6, 7, 4, 9, 7, ...
## $ partner_perceives_himself__sincere <dbl> 9, 6, 7, 8, 7, 7, 9, 7, ...
## $ partner_perceives_himself__fun <dbl> 9, 8, 6, 6, 10, 8, 9, 7,...
## $ partner_perceives_himself__intelligent <dbl> 9, 8, 8, 8, 10, 8, 9, 9,...
## $ partner_perceives_himself__ambitious <dbl> 9, 6, 4, 9, 10, 3, 9, 9,...
Executar o cross validation, como vimos, significa estimar o modelo várias vezes em execuções que são completamente independentes, o que é perfeito para executar em paralelo.
A função fit_resamples() faz todas as estimações.
É possível registrar um backend para fazer a execução de forma paralela
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
log_mod <-
logistic_reg() %>%
set_engine("glm")
wf_com_interacao_class <- workflow() %>%
add_recipe(receita_com_interacao_class) %>%
add_model(log_mod)
fit_com_interacao_resample <-
wf_com_interacao_class %>%
fit_resamples(
folds,
control = control_resamples(
allow_par = TRUE
)
)
stopCluster(cl)
fit_com_interacao_resample## # Resampling results
## # 5-fold cross-validation repeated 2 times using stratification
## # A tibble: 10 x 5
## splits id id2 .metrics .notes
## <list> <chr> <chr> <list> <list>
## 1 <split [2.9K/734]> Repeat1 Fold1 <tibble [2 x 4]> <tibble [2 x 1]>
## 2 <split [2.9K/734]> Repeat1 Fold2 <tibble [2 x 4]> <tibble [3 x 1]>
## 3 <split [2.9K/732]> Repeat1 Fold3 <tibble [2 x 4]> <tibble [3 x 1]>
## 4 <split [2.9K/732]> Repeat1 Fold4 <tibble [2 x 4]> <tibble [2 x 1]>
## 5 <split [2.9K/732]> Repeat1 Fold5 <tibble [2 x 4]> <tibble [3 x 1]>
## 6 <split [2.9K/734]> Repeat2 Fold1 <tibble [2 x 4]> <tibble [3 x 1]>
## 7 <split [2.9K/734]> Repeat2 Fold2 <tibble [2 x 4]> <tibble [2 x 1]>
## 8 <split [2.9K/732]> Repeat2 Fold3 <tibble [2 x 4]> <tibble [3 x 1]>
## 9 <split [2.9K/732]> Repeat2 Fold4 <tibble [2 x 4]> <tibble [3 x 1]>
## 10 <split [2.9K/732]> Repeat2 Fold5 <tibble [2 x 4]> <tibble [2 x 1]>
A função collect_metrics() recebe o objeto retornado por fit_resamples() e retorna uma estrutura com os resultados das execuções, no nosso caso 10. Como não há hiperparâmetros para variar, há apenas duas métricas relativas ao mesmo conjunto.
collect_metrics(fit_com_interacao_resample) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
)| .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|
| accuracy | binary | 77.13% | 10 | 0.66% | Preprocessor1_Model1 |
| roc_auc | binary | 84.78% | 10 | 0.54% | Preprocessor1_Model1 |
O modelo de regressão logística que rodamos anteriormente não tinha hiperparâmetros, mas ao rodar modelos como Elastic Net (lasso-ridge) gostaríamos de avaliar qual o melhor conjunto de hiperparâmetros nos dados de validação.
A biblioteca tune nos ajuda a fazer esse procedimento.
No momento de definir o modelo, atribuimos a resposta de tune() aos parâmetros do modelo que queremos variar para fins de tuning
tune_spec_logistic_reg <- logistic_reg(
penalty = tune(),
mixture = tune()
) %>%
set_engine(
engine = "glmnet"
) %>%
set_mode(
"classification"
)
tune_spec_logistic_reg## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = tune()
##
## Computational engine: glmnet
Uma das formas de executar esta busca pelo melhor conjunto de hiperparâmetros é criar um grid com várias especificações.
A biblioteca dials ajuda na criação deste grid de valores pros hiperparâmetros.
A função grid_regular() cria um grid com intervalos regulares (esses intervalos regulares, dependendo do parãmetro podem ser regulares em log, por exemplo).
Existem funções com os mesmos nomes usados para os parâmetros na interface genérica da parnsnip. Quando chamadas sem parâmetros essas funções geram valores que normalmente fazem sentido, mas é possível escolher os valores de forma personalizada.
O parâmetro levels define quantos valores diferentes serão usados para cada parâmetro.
net_grid <- grid_regular(
penalty(range = c(-4,-1)),
mixture(),
levels = c(
penalty = 10,
mixture = 10
)
)
net_grid## # A tibble: 100 x 2
## penalty mixture
## <dbl> <dbl>
## 1 0.0001 0
## 2 0.000215 0
## 3 0.000464 0
## 4 0.001 0
## 5 0.00215 0
## 6 0.00464 0
## 7 0.01 0
## 8 0.0215 0
## 9 0.0464 0
## 10 0.1 0
## # ... with 90 more rows
A função tunegrid() roda a busca dentro deste grid de valores de parâmetros rodando o processo de cross validation de acordo com o que for passado para o parâmetro resample.
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
wf_logreg_tune_sample <- workflow() %>%
add_model(
tune_spec_logistic_reg
) %>%
add_recipe(receita_com_interacao_class)
res_logreg_tune_sample_optim <- wf_logreg_tune_sample %>%
tune_grid(
resamples = folds,
grid = net_grid,
control = control_grid(allow_par = TRUE)
)
stopCluster(cl)
res_logreg_tune_sample_optim## # Tuning results
## # 5-fold cross-validation repeated 2 times using stratification
## # A tibble: 10 x 5
## splits id id2 .metrics .notes
## <list> <chr> <chr> <list> <list>
## 1 <split [2.9K/734]> Repeat1 Fold1 <tibble [200 x 6]> <tibble [1 x 1]>
## 2 <split [2.9K/734]> Repeat1 Fold2 <tibble [200 x 6]> <tibble [1 x 1]>
## 3 <split [2.9K/732]> Repeat1 Fold3 <tibble [200 x 6]> <tibble [1 x 1]>
## 4 <split [2.9K/732]> Repeat1 Fold4 <tibble [200 x 6]> <tibble [1 x 1]>
## 5 <split [2.9K/732]> Repeat1 Fold5 <tibble [200 x 6]> <tibble [1 x 1]>
## 6 <split [2.9K/734]> Repeat2 Fold1 <tibble [200 x 6]> <tibble [1 x 1]>
## 7 <split [2.9K/734]> Repeat2 Fold2 <tibble [200 x 6]> <tibble [1 x 1]>
## 8 <split [2.9K/732]> Repeat2 Fold3 <tibble [200 x 6]> <tibble [1 x 1]>
## 9 <split [2.9K/732]> Repeat2 Fold4 <tibble [200 x 6]> <tibble [1 x 1]>
## 10 <split [2.9K/732]> Repeat2 Fold5 <tibble [200 x 6]> <tibble [1 x 1]>
Uma forma de visualizar os resultados é com geom_tile, após rodar a função collect_metrics()
plot_result_tune <- function(results){
results %>%
collect_metrics() %>%
filter(
.metric == "roc_auc"
) %>%
mutate(
ranque = rank(mean)
) %>%
ggplot() +
geom_tile(
aes(
x = penalty,
y = mixture,
fill = ranque
)
) +
geom_text(
aes(
x = penalty,
y = mixture,
label = percent(mean, accuracy = .01),
),
size = 3
) +
scale_x_log10() +
scale_fill_gradient(low = "white", high = "darkgreen") +
theme_minimal() +
theme(
legend.position = "top"
)
}
plot_result_tune(res_logreg_tune_sample_optim)res_logreg_tune_sample_optim %>%
collect_metrics() %>%
filter(
.metric == "roc_auc"
) %>%
arrange(mean %>% desc()) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) %>%
fmt_number(
columns = vars(penalty, mixture),
n_sigfig = 2
)| penalty | mixture | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 0.010 | 0.22 | roc_auc | binary | 85.18% | 10 | 0.51% | Preprocessor1_Model027 |
| 0.010 | 0.33 | roc_auc | binary | 85.17% | 10 | 0.51% | Preprocessor1_Model037 |
| 0.022 | 0.11 | roc_auc | binary | 85.16% | 10 | 0.49% | Preprocessor1_Model018 |
| 0.0046 | 0.67 | roc_auc | binary | 85.14% | 10 | 0.52% | Preprocessor1_Model066 |
| 0.0046 | 0.56 | roc_auc | binary | 85.14% | 10 | 0.52% | Preprocessor1_Model056 |
| 0.0046 | 0.44 | roc_auc | binary | 85.13% | 10 | 0.53% | Preprocessor1_Model046 |
| 0.0022 | 1.0 | roc_auc | binary | 85.13% | 10 | 0.54% | Preprocessor1_Model095 |
| 0.010 | 0.44 | roc_auc | binary | 85.13% | 10 | 0.51% | Preprocessor1_Model047 |
| 0.010 | 0.11 | roc_auc | binary | 85.13% | 10 | 0.52% | Preprocessor1_Model017 |
| 0.0046 | 0.78 | roc_auc | binary | 85.13% | 10 | 0.52% | Preprocessor1_Model076 |
| 0.0022 | 0.89 | roc_auc | binary | 85.12% | 10 | 0.54% | Preprocessor1_Model085 |
| 0.0010 | 1.0 | roc_auc | binary | 85.11% | 10 | 0.55% | Preprocessor1_Model094 |
| 0.0046 | 0.89 | roc_auc | binary | 85.11% | 10 | 0.53% | Preprocessor1_Model086 |
| 0.0022 | 0.78 | roc_auc | binary | 85.11% | 10 | 0.54% | Preprocessor1_Model075 |
| 0.0010 | 0.89 | roc_auc | binary | 85.10% | 10 | 0.55% | Preprocessor1_Model084 |
| 0.0046 | 0.33 | roc_auc | binary | 85.10% | 10 | 0.53% | Preprocessor1_Model036 |
| 0.0022 | 0.67 | roc_auc | binary | 85.09% | 10 | 0.54% | Preprocessor1_Model065 |
| 0.0010 | 0.78 | roc_auc | binary | 85.09% | 10 | 0.54% | Preprocessor1_Model074 |
| 0.022 | 0.22 | roc_auc | binary | 85.09% | 10 | 0.48% | Preprocessor1_Model028 |
| 0.0046 | 1.0 | roc_auc | binary | 85.09% | 10 | 0.53% | Preprocessor1_Model096 |
| 0.0022 | 0.56 | roc_auc | binary | 85.08% | 10 | 0.53% | Preprocessor1_Model055 |
| 0.0010 | 0.67 | roc_auc | binary | 85.07% | 10 | 0.54% | Preprocessor1_Model064 |
| 0.0022 | 0.44 | roc_auc | binary | 85.07% | 10 | 0.53% | Preprocessor1_Model045 |
| 0.00010 | 0 | roc_auc | binary | 85.07% | 10 | 0.50% | Preprocessor1_Model001 |
| 0.00022 | 0 | roc_auc | binary | 85.07% | 10 | 0.50% | Preprocessor1_Model002 |
| 0.00046 | 0 | roc_auc | binary | 85.07% | 10 | 0.50% | Preprocessor1_Model003 |
| 0.0010 | 0 | roc_auc | binary | 85.07% | 10 | 0.50% | Preprocessor1_Model004 |
| 0.0022 | 0 | roc_auc | binary | 85.07% | 10 | 0.50% | Preprocessor1_Model005 |
| 0.0046 | 0 | roc_auc | binary | 85.07% | 10 | 0.50% | Preprocessor1_Model006 |
| 0.010 | 0 | roc_auc | binary | 85.07% | 10 | 0.50% | Preprocessor1_Model007 |
| 0.022 | 0 | roc_auc | binary | 85.07% | 10 | 0.50% | Preprocessor1_Model008 |
| 0.0046 | 0.22 | roc_auc | binary | 85.06% | 10 | 0.52% | Preprocessor1_Model026 |
| 0.010 | 0.56 | roc_auc | binary | 85.06% | 10 | 0.51% | Preprocessor1_Model057 |
| 0.0010 | 0.56 | roc_auc | binary | 85.05% | 10 | 0.54% | Preprocessor1_Model054 |
| 0.0022 | 0.33 | roc_auc | binary | 85.05% | 10 | 0.53% | Preprocessor1_Model035 |
| 0.046 | 0 | roc_auc | binary | 85.04% | 10 | 0.48% | Preprocessor1_Model009 |
| 0.0046 | 0.11 | roc_auc | binary | 85.04% | 10 | 0.52% | Preprocessor1_Model016 |
| 0.0010 | 0.44 | roc_auc | binary | 85.03% | 10 | 0.53% | Preprocessor1_Model044 |
| 0.00046 | 1.0 | roc_auc | binary | 85.03% | 10 | 0.53% | Preprocessor1_Model093 |
| 0.0022 | 0.22 | roc_auc | binary | 85.02% | 10 | 0.53% | Preprocessor1_Model025 |
| 0.00046 | 0.89 | roc_auc | binary | 85.02% | 10 | 0.53% | Preprocessor1_Model083 |
| 0.0010 | 0.33 | roc_auc | binary | 85.01% | 10 | 0.53% | Preprocessor1_Model034 |
| 0.00046 | 0.78 | roc_auc | binary | 85.01% | 10 | 0.53% | Preprocessor1_Model073 |
| 0.00046 | 0.67 | roc_auc | binary | 85.00% | 10 | 0.53% | Preprocessor1_Model063 |
| 0.0010 | 0.22 | roc_auc | binary | 84.99% | 10 | 0.53% | Preprocessor1_Model024 |
| 0.0022 | 0.11 | roc_auc | binary | 84.99% | 10 | 0.52% | Preprocessor1_Model015 |
| 0.00046 | 0.56 | roc_auc | binary | 84.99% | 10 | 0.53% | Preprocessor1_Model053 |
| 0.00046 | 0.44 | roc_auc | binary | 84.98% | 10 | 0.53% | Preprocessor1_Model043 |
| 0.010 | 0.67 | roc_auc | binary | 84.97% | 10 | 0.50% | Preprocessor1_Model067 |
| 0.00046 | 0.33 | roc_auc | binary | 84.96% | 10 | 0.53% | Preprocessor1_Model033 |
| 0.00022 | 1.0 | roc_auc | binary | 84.96% | 10 | 0.54% | Preprocessor1_Model092 |
| 0.0010 | 0.11 | roc_auc | binary | 84.96% | 10 | 0.53% | Preprocessor1_Model014 |
| 0.00022 | 0.89 | roc_auc | binary | 84.96% | 10 | 0.53% | Preprocessor1_Model082 |
| 0.00022 | 0.78 | roc_auc | binary | 84.95% | 10 | 0.54% | Preprocessor1_Model072 |
| 0.046 | 0.11 | roc_auc | binary | 84.95% | 10 | 0.45% | Preprocessor1_Model019 |
| 0.00046 | 0.22 | roc_auc | binary | 84.94% | 10 | 0.53% | Preprocessor1_Model023 |
| 0.00022 | 0.67 | roc_auc | binary | 84.94% | 10 | 0.53% | Preprocessor1_Model062 |
| 0.00010 | 1.0 | roc_auc | binary | 84.93% | 10 | 0.53% | Preprocessor1_Model091 |
| 0.00022 | 0.56 | roc_auc | binary | 84.93% | 10 | 0.53% | Preprocessor1_Model052 |
| 0.00046 | 0.11 | roc_auc | binary | 84.92% | 10 | 0.53% | Preprocessor1_Model013 |
| 0.00022 | 0.44 | roc_auc | binary | 84.92% | 10 | 0.53% | Preprocessor1_Model042 |
| 0.022 | 0.33 | roc_auc | binary | 84.92% | 10 | 0.47% | Preprocessor1_Model038 |
| 0.00010 | 0.89 | roc_auc | binary | 84.91% | 10 | 0.53% | Preprocessor1_Model081 |
| 0.00022 | 0.33 | roc_auc | binary | 84.91% | 10 | 0.53% | Preprocessor1_Model032 |
| 0.00010 | 0.78 | roc_auc | binary | 84.91% | 10 | 0.54% | Preprocessor1_Model071 |
| 0.00010 | 0.67 | roc_auc | binary | 84.90% | 10 | 0.54% | Preprocessor1_Model061 |
| 0.00022 | 0.22 | roc_auc | binary | 84.90% | 10 | 0.53% | Preprocessor1_Model022 |
| 0.10 | 0 | roc_auc | binary | 84.90% | 10 | 0.44% | Preprocessor1_Model010 |
| 0.00010 | 0.56 | roc_auc | binary | 84.90% | 10 | 0.54% | Preprocessor1_Model051 |
| 0.00022 | 0.11 | roc_auc | binary | 84.89% | 10 | 0.53% | Preprocessor1_Model012 |
| 0.00010 | 0.11 | roc_auc | binary | 84.89% | 10 | 0.53% | Preprocessor1_Model011 |
| 0.00010 | 0.44 | roc_auc | binary | 84.88% | 10 | 0.53% | Preprocessor1_Model041 |
| 0.00010 | 0.33 | roc_auc | binary | 84.88% | 10 | 0.53% | Preprocessor1_Model031 |
| 0.00010 | 0.22 | roc_auc | binary | 84.87% | 10 | 0.53% | Preprocessor1_Model021 |
| 0.010 | 0.78 | roc_auc | binary | 84.86% | 10 | 0.50% | Preprocessor1_Model077 |
| 0.010 | 0.89 | roc_auc | binary | 84.75% | 10 | 0.50% | Preprocessor1_Model087 |
| 0.022 | 0.44 | roc_auc | binary | 84.72% | 10 | 0.48% | Preprocessor1_Model048 |
| 0.010 | 1.0 | roc_auc | binary | 84.65% | 10 | 0.51% | Preprocessor1_Model097 |
| 0.046 | 0.22 | roc_auc | binary | 84.57% | 10 | 0.45% | Preprocessor1_Model029 |
| 0.022 | 0.56 | roc_auc | binary | 84.49% | 10 | 0.49% | Preprocessor1_Model058 |
| 0.10 | 0.11 | roc_auc | binary | 84.33% | 10 | 0.44% | Preprocessor1_Model020 |
| 0.022 | 0.67 | roc_auc | binary | 84.29% | 10 | 0.51% | Preprocessor1_Model068 |
| 0.046 | 0.33 | roc_auc | binary | 84.20% | 10 | 0.48% | Preprocessor1_Model039 |
| 0.022 | 0.78 | roc_auc | binary | 84.12% | 10 | 0.52% | Preprocessor1_Model078 |
| 0.022 | 0.89 | roc_auc | binary | 83.89% | 10 | 0.53% | Preprocessor1_Model088 |
| 0.046 | 0.44 | roc_auc | binary | 83.84% | 10 | 0.51% | Preprocessor1_Model049 |
| 0.022 | 1.0 | roc_auc | binary | 83.67% | 10 | 0.53% | Preprocessor1_Model098 |
| 0.10 | 0.22 | roc_auc | binary | 83.62% | 10 | 0.47% | Preprocessor1_Model030 |
| 0.046 | 0.56 | roc_auc | binary | 83.44% | 10 | 0.51% | Preprocessor1_Model059 |
| 0.046 | 0.67 | roc_auc | binary | 83.09% | 10 | 0.51% | Preprocessor1_Model069 |
| 0.10 | 0.33 | roc_auc | binary | 83.03% | 10 | 0.48% | Preprocessor1_Model040 |
| 0.046 | 0.78 | roc_auc | binary | 82.71% | 10 | 0.50% | Preprocessor1_Model079 |
| 0.10 | 0.44 | roc_auc | binary | 82.65% | 10 | 0.49% | Preprocessor1_Model050 |
| 0.046 | 0.89 | roc_auc | binary | 82.36% | 10 | 0.51% | Preprocessor1_Model089 |
| 0.10 | 0.56 | roc_auc | binary | 82.25% | 10 | 0.50% | Preprocessor1_Model060 |
| 0.046 | 1.0 | roc_auc | binary | 82.14% | 10 | 0.52% | Preprocessor1_Model099 |
| 0.10 | 0.67 | roc_auc | binary | 81.80% | 10 | 0.52% | Preprocessor1_Model070 |
| 0.10 | 0.78 | roc_auc | binary | 81.55% | 10 | 0.54% | Preprocessor1_Model080 |
| 0.10 | 0.89 | roc_auc | binary | 81.24% | 10 | 0.56% | Preprocessor1_Model090 |
| 0.10 | 1.0 | roc_auc | binary | 80.84% | 10 | 0.58% | Preprocessor1_Model100 |
A árvore de decisão particiona o espaço formado pelas variáveis explicativas em subespaços baseando-se na “pureza” desses subespaços com relação à variável dependente.
Abaixo fazemos uma experiência com apenas 2 features contínuas e uma feature categórrica com dois valores possíveis .
receita_arvore_decisao_demo <- recipe(
i_liked_partner ~
i_found_partner__attractive +
sex +
my_age,
data = dado_treino
)
arvore_decisao_mod <-
decision_tree(
tree_depth = 4,
min_n = 1,
cost_complexity = 0
) %>%
set_engine("rpart") %>%
set_mode("classification")
wf_arvore_decisao_demo <- workflow() %>%
add_recipe(receita_arvore_decisao_demo) %>%
add_model(arvore_decisao_mod)
fit_ad_demo <-
wf_arvore_decisao_demo %>%
fit(
data = dado_treino
)O algoritmo cira uma árvore de decisão como essa.
Ele escolhe, portanto, a ordem e os valores dos atributos que fatiarão a população. Seguindo a árvore até suas folhas, que são os nós sem filhos, podemos determinar a saída prevista de cada valor do vetor x.
Temos um espaço formado por duas features contínuas e uma feature categórica com duas categorias, podemos pensar este espaço como dois planos.
O algoritmo da árvore de decisão escolhe qual feature divide cada plano em duas partes mais puras no sentido da classificação (com mais pontos com a mesma categoria).
Como temos duas features contínuas e mais uma com duas categorias possíveis, isso é equivalente a dividir o espaço representado por dois planos com retas na vertical e horizontal.
valores_partner_atractive = tibble(i_found_partner__attractive = seq(0, 10, by = 0.1))
valores_my_age =
tibble(
my_age = seq(min(dado_treino$my_age), max(dado_treino$my_age), by = 1 )
)
valores_sex = tibble( sex = c("Homem", "Mulher") )
dados_novos <- crossing(
valores_partner_atractive,
valores_my_age,
valores_sex
)
predicoes_arvore <- predict(
object = fit_ad_demo,
new_data = dados_novos
)
predicoes_arvore_com_dados <- bind_cols(
dados_novos,
predicoes_arvore
)
ggplot(predicoes_arvore_com_dados) +
geom_tile(
aes(
x = i_found_partner__attractive,
y = my_age,
fill = .pred_class,
alpha = 0.1
)
) +
geom_jitter(
data = dado_treino,
width = 0.5,
aes(
x = i_found_partner__attractive,
y = my_age,
color = i_liked_partner
),
size = 0.8,
show.legend = FALSE
) +
geom_vline(
xintercept = 6.75
) +
geom_vline(
xintercept = 7.25
) +
geom_hline(
data = tibble(
my_age = 20.5,
sex = "Homem"
),
aes(
yintercept = my_age
)
) +
geom_hline(
data = tibble(
my_age = 28.5,
sex = "Mulher"
),
aes(
yintercept = my_age
)
) +
facet_wrap(
~sex
) +
guides(
alpha = FALSE
) +
theme_minimal() +
scale_fill_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
scale_color_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
theme(
legend.position = "top"
) +
labs(
fill = "",
x = "Achei bonito",
y = "Minha idade"
)log_reg_mod <-
logistic_reg(
) %>%
set_engine("glm") %>%
set_mode("classification")
wf_log_reg_demo <- workflow() %>%
add_recipe(receita_arvore_decisao_demo) %>%
add_model(log_reg_mod)
fit_log_reg_demo <-
wf_log_reg_demo %>%
fit(
data = dado_treino
)A título de comparação, podemos ver o mesmo esquema com a regressão logística
predicoes_log_reg <- predict(
object = fit_log_reg_demo,
new_data = dados_novos
)
predicoes_log_reg_com_dados <- bind_cols(
dados_novos,
predicoes_log_reg
)
ggplot(predicoes_log_reg_com_dados) +
geom_tile(
aes(
x = i_found_partner__attractive,
y = my_age,
fill = .pred_class,
alpha = 0.1
)
) +
geom_jitter(
data = dado_treino,
width = 0.5,
aes(
x = i_found_partner__attractive,
y = my_age,
color = i_liked_partner
),
size = 0.8,
show.legend = FALSE
) +
facet_wrap(
~sex
) +
guides(
alpha = FALSE
) +
theme_minimal() +
scale_fill_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
scale_color_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
theme(
legend.position = "top"
) +
labs(
fill = "",
x = "Achei bonito",
y = "Minha idade"
)A forma como a árvore de decisão é criada faz com que ela tenha muita variância.
Cada decisão de particionamento é tomada a partir de características que podem ser muito específicas aos dados de treinamento.
Uma ideia usada nas Random Forests é criar \(trees\) conjuntos de treinamento a partir do conjunto original, mas retirando amostras de mesmo tamanho do conjunto original, COM REPOSIÇÃO. Além disso, cada vez que a árvore é particionada, a partição só pode acontecer em \(mtry\) das variáveis explicativas.
O resultado final é uma média da decisão dessas \(trees\) árvores.
Estas duas mudanças fazem com que o modelo tenha uma variância muito menor do que as árvore de decisão simples.
Abaixo podemos ver que o modelo oferece muito mais flexibilidade que as árvores de decisão simples.
predicoes_ranger <- predict(
object = fit_ranger_demo,
new_data = dados_novos
)
predicoes_ranger_com_dados <- bind_cols(
dados_novos,
predicoes_ranger
)
ggplot(predicoes_ranger_com_dados) +
geom_tile(
aes(
x = i_found_partner__attractive,
y = my_age,
fill = .pred_class,
alpha = 0.1
)
) +
geom_jitter(
data = dado_treino,
width = 0.5,
aes(
x = i_found_partner__attractive,
y = my_age,
color = i_liked_partner
),
size = 0.8,
show.legend = FALSE
) +
facet_wrap(
~sex
) +
guides(
alpha = FALSE
) +
theme_minimal() +
scale_fill_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
scale_color_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
theme(
legend.position = "top"
) +
labs(
fill = "",
x = "Achei bonito",
y = "Minha idade"
)Abaixo montamos a nova configuração de modelo, com seus hiperparâmetros a serem tunados.
tune_spec_rand_forest <- rand_forest(
mtry = tune(),
trees = tune(),
min_n = tune()
) %>%
set_engine(
engine = "ranger"
) %>%
set_mode(
"classification"
)
tune_spec_rand_forest## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = tune()
## min_n = tune()
##
## Computational engine: ranger
Vamos criar um primeiro grid para tentar otimizar o valor dos hiperparâmetros
rand_for_grid <- grid_regular(
mtry(range = c(3,50)),
trees(range = c(1,100)),
min_n(),
levels = 4
)
rand_for_grid## # A tibble: 64 x 3
## mtry trees min_n
## <int> <int> <int>
## 1 3 1 2
## 2 18 1 2
## 3 34 1 2
## 4 50 1 2
## 5 3 34 2
## 6 18 34 2
## 7 34 34 2
## 8 50 34 2
## 9 3 67 2
## 10 18 67 2
## # ... with 54 more rows
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
wf_rand_for_tune_sample <- workflow() %>%
add_model(
tune_spec_rand_forest
) %>%
add_recipe(receita_com_interacao_class)
res_rand_for <- wf_rand_for_tune_sample %>%
tune_grid(
resamples = folds,
grid = rand_for_grid,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)res_rand_for <- read_rds("resultados/res_rand_for.rds" )
collect_metrics(res_rand_for) %>% arrange(desc(mean)) %>%
filter(
.metric == "roc_auc"
) %>%
head(
n = 10
) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) | mtry | trees | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 18 | 100 | 2 | roc_auc | binary | 88.03% | 10 | 0.48% | Preprocessor1_Model14 |
| 50 | 100 | 2 | roc_auc | binary | 87.72% | 10 | 0.49% | Preprocessor1_Model16 |
| 34 | 100 | 2 | roc_auc | binary | 87.67% | 10 | 0.46% | Preprocessor1_Model15 |
| 18 | 67 | 2 | roc_auc | binary | 87.67% | 10 | 0.47% | Preprocessor1_Model10 |
| 18 | 100 | 14 | roc_auc | binary | 87.60% | 10 | 0.50% | Preprocessor1_Model30 |
| 50 | 67 | 2 | roc_auc | binary | 87.44% | 10 | 0.49% | Preprocessor1_Model12 |
| 34 | 100 | 14 | roc_auc | binary | 87.43% | 10 | 0.50% | Preprocessor1_Model31 |
| 34 | 67 | 2 | roc_auc | binary | 87.43% | 10 | 0.44% | Preprocessor1_Model11 |
| 18 | 67 | 14 | roc_auc | binary | 87.41% | 10 | 0.47% | Preprocessor1_Model26 |
| 50 | 100 | 14 | roc_auc | binary | 87.34% | 10 | 0.43% | Preprocessor1_Model32 |
plot_result_tune_ranger <- function(results){
results %>%
map_df(
.f = collect_metrics
) %>%
filter(
.metric == "roc_auc"
) %>%
mutate(
ranque = rank(mean)
) %>%
ggplot() +
geom_tile(
aes(
x = mtry %>% factor(),
y = trees %>% factor(),
fill = ranque
)
) +
geom_text(
aes(
x = mtry %>% factor(),
y = trees %>% factor(),
label = percent(mean, accuracy = .01),
),
color = "black",
size = 3,
face = "bold"
) +
scale_fill_gradient(low = "white", high = "darkgreen") +
facet_wrap(
~min_n,
ncol = 1,
labeller = as_labeller( function(x){str_glue("min_n: {x}")} )
) +
theme_minimal() +
theme(
legend.position = "top"
) +
labs(
x = "mtry",
y = "trees"
)
}
plot_result_tune_ranger(list(res_rand_for))rand_for_grid_optim <- grid_regular(
mtry(range = c(4,45)),
trees(range = c(150,350)),
min_n(c(1,6)),
levels = 4
)
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_rand_for_optim <- wf_rand_for_tune_sample %>%
tune_grid(
resamples = folds,
grid = rand_for_grid_optim,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)res_rand_for_optim <- read_rds("resultados/res_rand_for_optim.rds" )
collect_metrics(res_rand_for_optim) %>% arrange(desc(mean)) %>%
filter(
.metric == "roc_auc"
) %>%
head(
n = 10
) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) | mtry | trees | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 17 | 283 | 1 | roc_auc | binary | 88.29% | 10 | 0.46% | Preprocessor1_Model10 |
| 17 | 350 | 2 | roc_auc | binary | 88.19% | 10 | 0.45% | Preprocessor1_Model30 |
| 17 | 350 | 1 | roc_auc | binary | 88.18% | 10 | 0.44% | Preprocessor1_Model14 |
| 17 | 216 | 2 | roc_auc | binary | 88.18% | 10 | 0.45% | Preprocessor1_Model22 |
| 17 | 216 | 1 | roc_auc | binary | 88.13% | 10 | 0.47% | Preprocessor1_Model06 |
| 17 | 350 | 4 | roc_auc | binary | 88.12% | 10 | 0.46% | Preprocessor1_Model46 |
| 17 | 283 | 2 | roc_auc | binary | 88.12% | 10 | 0.44% | Preprocessor1_Model26 |
| 17 | 350 | 6 | roc_auc | binary | 88.10% | 10 | 0.43% | Preprocessor1_Model62 |
| 17 | 283 | 4 | roc_auc | binary | 88.09% | 10 | 0.46% | Preprocessor1_Model42 |
| 31 | 283 | 4 | roc_auc | binary | 88.07% | 10 | 0.45% | Preprocessor1_Model43 |
rand_for_grid_optim_2 <- grid_regular(
mtry(range = c(10,35)),
trees(range = c(350,500)),
min_n(range = c(1, 3)),
levels = c(mtry = 3, trees = 4, min_n = 3)
)
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_rand_for_optim_2 <- wf_rand_for_tune_sample %>%
tune_grid(
resamples = folds,
grid = rand_for_grid_optim_2,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)res_rand_for_optim_2 <- read_rds("resultados/res_rand_for_optim_2.rds" )
collect_metrics(res_rand_for_optim_2) %>% arrange(desc(mean)) %>%
filter(
.metric == "roc_auc"
) %>%
head(
n = 10
) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) | mtry | trees | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 10 | 500 | 1 | roc_auc | binary | 88.31% | 10 | 0.42% | Preprocessor1_Model10 |
| 22 | 500 | 1 | roc_auc | binary | 88.30% | 10 | 0.42% | Preprocessor1_Model11 |
| 10 | 500 | 2 | roc_auc | binary | 88.26% | 10 | 0.42% | Preprocessor1_Model22 |
| 22 | 350 | 1 | roc_auc | binary | 88.24% | 10 | 0.44% | Preprocessor1_Model02 |
| 10 | 450 | 1 | roc_auc | binary | 88.24% | 10 | 0.43% | Preprocessor1_Model07 |
| 22 | 450 | 1 | roc_auc | binary | 88.24% | 10 | 0.47% | Preprocessor1_Model08 |
| 22 | 400 | 1 | roc_auc | binary | 88.24% | 10 | 0.48% | Preprocessor1_Model05 |
| 10 | 400 | 2 | roc_auc | binary | 88.24% | 10 | 0.43% | Preprocessor1_Model16 |
| 22 | 500 | 3 | roc_auc | binary | 88.22% | 10 | 0.45% | Preprocessor1_Model35 |
| 22 | 500 | 2 | roc_auc | binary | 88.21% | 10 | 0.47% | Preprocessor1_Model23 |
rand_for_grid_optim_3 <- grid_regular(
mtry(range = c(8, 22)),
trees(range = c(400,500)),
min_n(range = c(1,1)),
levels = c(mtry = 5, trees = 5, min_n = 1)
)
rand_for_grid_optim_3## # A tibble: 25 x 3
## mtry trees min_n
## <int> <int> <int>
## 1 8 400 1
## 2 11 400 1
## 3 15 400 1
## 4 18 400 1
## 5 22 400 1
## 6 8 425 1
## 7 11 425 1
## 8 15 425 1
## 9 18 425 1
## 10 22 425 1
## # ... with 15 more rows
res_rand_for_optim_3 <- read_rds("resultados/res_rand_for_optim_3.rds" )
collect_metrics(res_rand_for_optim_3) %>% arrange(desc(mean)) %>%
filter(
.metric == "roc_auc"
) %>%
head(
n = 10
) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) | mtry | trees | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 22 | 475 | 1 | roc_auc | binary | 88.25% | 10 | 0.42% | Preprocessor1_Model20 |
| 18 | 500 | 1 | roc_auc | binary | 88.25% | 10 | 0.46% | Preprocessor1_Model24 |
| 15 | 450 | 1 | roc_auc | binary | 88.24% | 10 | 0.45% | Preprocessor1_Model13 |
| 15 | 500 | 1 | roc_auc | binary | 88.24% | 10 | 0.45% | Preprocessor1_Model23 |
| 18 | 475 | 1 | roc_auc | binary | 88.23% | 10 | 0.44% | Preprocessor1_Model19 |
| 11 | 450 | 1 | roc_auc | binary | 88.22% | 10 | 0.42% | Preprocessor1_Model12 |
| 15 | 425 | 1 | roc_auc | binary | 88.22% | 10 | 0.44% | Preprocessor1_Model08 |
| 11 | 500 | 1 | roc_auc | binary | 88.22% | 10 | 0.44% | Preprocessor1_Model22 |
| 18 | 425 | 1 | roc_auc | binary | 88.22% | 10 | 0.46% | Preprocessor1_Model09 |
| 18 | 400 | 1 | roc_auc | binary | 88.22% | 10 | 0.45% | Preprocessor1_Model04 |
plot_result_tune_ranger(
list(
res_rand_for,
res_rand_for_optim,
res_rand_for_optim_2,
res_rand_for_optim_3
)
)Apesar do hype que envolve as redes neurais, que faz elas parecerem mágicas e misteriosas, elas são métodos não lineares onde são aplicadas regressões lineares em cima de saída de outras regressões. Algumas dessa saídas são modificadas por funções de ativação não lineares, dando caráter não linear ao método. Os coeficientes dessas regressões, também chamados de pesos, são calibrados de forma a minimizar o erro através de um algoritmo inteligente chamado backpropagation.
Existem várias arquiteturas de redes neurais.
Vamos usar a arquitetura mais simples e mais usada: single hidden layer.
Como mostra a figura à esquerda, esse modelo tem uma camada de entrada, uma camada “escondida” intermediária e uma camada de saída.
\(p\) é o número de entradas, \(K\) o número de categorias de saída possíveis (no caso de classificação, no caso de regressão há uma saída) e \(M\), o número de neurônios na camada escondida.
O nome neurônio vem de uma simplificação de como funciona a célula. Assim como os nós da rede neural, a célula também recebe várias entradas e tem um processo de ativação (liga ou desliga) dependendo da intensidade da informação que rebebe nas entradas.
No entando o cérebro humano é uma rede extremamente complexa com uma arquitetura que foi selecionada durante bilhões de anos.
Existem arquiteturas de rede neural mais complicadas, mas o número de pesos a serem calibrados é explosivo. Elas exigem uma quantidade massiva de dados. A área que estuda essa redes profundas se chama deep learning.
Vamos fazer o mesmo exercício que já fizemos com os outros modelos, treinando ela para apenas dois features contínuos e um categórico.
set.seed(555)
receita_nnet_demo <- recipe(
i_liked_partner ~
i_found_partner__attractive +
sex +
my_age,
data = dado_treino
)
nnet_mod <-
mlp(
hidden_units = 30,
epochs = 1000
) %>%
set_engine("nnet") %>%
set_mode("classification")
wf_nnet_demo <- workflow() %>%
add_recipe(receita_nnet_demo) %>%
add_model(nnet_mod)
fit_nnet_demo <-
wf_nnet_demo %>%
fit(
data = dado_treino
)
fit_nnet_demo## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: mlp()
##
## -- Preprocessor ----------------------------------------------------------------
## 0 Recipe Steps
##
## -- Model -----------------------------------------------------------------------
## a 3-30-1 network with 151 weights
## inputs: i_found_partner__attractive sexHomem my_age
## output(s): ..y
## options were - entropy fitting
Podemos perceber que o modelo cria um padrão de classificação complexo, que consegue considerar a interação entre as entradas.
predicoes_nnet <- predict(
object = fit_nnet_demo,
new_data = dados_novos
)
predicoes_nnet_com_dados <- bind_cols(
dados_novos,
predicoes_nnet
)
ggplot(predicoes_nnet_com_dados) +
geom_tile(
aes(
x = i_found_partner__attractive,
y = my_age,
fill = .pred_class,
alpha = 0.1
)
) +
geom_jitter(
data = dado_treino,
width = 0.5,
aes(
x = i_found_partner__attractive,
y = my_age,
color = i_liked_partner
),
size = 0.8,
show.legend = FALSE
) +
facet_wrap(
~sex
) +
guides(
alpha = FALSE
) +
theme_minimal() +
scale_fill_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
scale_color_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
theme(
legend.position = "top"
) +
labs(
fill = "",
x = "Achei bonito",
y = "Minha idade"
)hidden units é o número de neurônios na camada escondida. Quanto maior o número maior a flexibilidade da rede.
penalty é uma penalidade similar à da regressão do tipo ridge, também limita o número de pesos ativos.
epochs é o número de iterações usadas pra treinar a rede. Quanto mais iterações mais adaptada ao conjunto de entrada a rede estará.
## # A tibble: 27 x 3
## hidden_units penalty epochs
## <int> <dbl> <int>
## 1 1 0.0000000001 10
## 2 5 0.0000000001 10
## 3 10 0.0000000001 10
## 4 1 0.00001 10
## 5 5 0.00001 10
## 6 10 0.00001 10
## 7 1 1 10
## 8 5 1 10
## 9 10 1 10
## 10 1 0.0000000001 505
## # ... with 17 more rows
wf_nnet_tune_sample <- workflow() %>%
add_model(
tune_spec_nnet
) %>%
add_recipe(receita_com_interacao_class)
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_nnet <- wf_nnet_tune_sample %>%
tune_grid(
resamples = folds,
grid = nnet_grid,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)
res_nnet## # A tibble: 36 x 9
## hidden_units penalty epochs .metric .estimator mean n std_err .config
## <int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 1.00e+ 0 505 roc_auc binary 0.848 10 0.00393 Preproc~
## 2 1 1.00e+ 0 1000 roc_auc binary 0.848 10 0.00395 Preproc~
## 3 5 1.00e+ 0 1000 roc_auc binary 0.844 10 0.00419 Preproc~
## 4 5 1.00e+ 0 505 roc_auc binary 0.837 10 0.00297 Preproc~
## 5 5 1.00e- 5 1000 roc_auc binary 0.793 10 0.00817 Preproc~
## 6 5 1.00e-10 505 roc_auc binary 0.790 10 0.00923 Preproc~
## 7 5 1.00e-10 1000 roc_auc binary 0.787 10 0.00875 Preproc~
## 8 5 1.00e- 5 505 roc_auc binary 0.784 10 0.0116 Preproc~
## 9 1 1.00e+ 0 1000 accura~ binary 0.769 10 0.00487 Preproc~
## 10 1 1.00e+ 0 505 accura~ binary 0.768 10 0.00459 Preproc~
## # ... with 26 more rows
plot_result_tune_nnet <- function(results){
results %>%
map_df(
.f = collect_metrics
) %>%
filter(
.metric == "roc_auc"
) %>%
group_by(
penalty, epochs, hidden_units
) %>%
summarise(
mean = mean(mean)
) %>%
ungroup() %>%
mutate(
ranque = rank(mean)
) %>%
ggplot() +
geom_tile(
aes(
x = penalty ,
y = epochs %>% factor(),
fill = ranque
)
) +
scale_x_log10() +
geom_text(
aes(
x = penalty,
y = epochs %>% factor(),
label = percent(mean, accuracy = .01),
),
color = "black",
size = 3,
face = "bold"
) +
scale_fill_gradient(low = "white", high = "darkgreen") +
facet_wrap(
~hidden_units,
ncol = 1,
labeller = as_labeller( function(x){str_glue("hidden: {x}")} )
) +
theme_minimal() +
theme(
legend.position = "top"
)
}
plot_result_tune_nnet(list(res_nnet))nnet_grid_2 <- grid_regular(
hidden_units(range = c(1,5)),
penalty( range = c(-1,1)),
epochs(range = c(500, 2000)),
levels = c(hidden_units = 2, penalty = 3, epochs = 4)
)
nnet_grid_2## # A tibble: 24 x 3
## hidden_units penalty epochs
## <int> <dbl> <int>
## 1 1 0.1 500
## 2 5 0.1 500
## 3 1 1 500
## 4 5 1 500
## 5 1 10 500
## 6 5 10 500
## 7 1 0.1 1000
## 8 5 0.1 1000
## 9 1 1 1000
## 10 5 1 1000
## # ... with 14 more rows
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_nnet_2 <- wf_nnet_tune_sample %>%
tune_grid(
resamples = folds,
grid = nnet_grid_2,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)res_nnet_2 <- read_rds("resultados/res_nnet_2.rds")
collect_metrics(res_nnet_2) %>% arrange(desc(mean))## # A tibble: 48 x 9
## hidden_units penalty epochs .metric .estimator mean n std_err .config
## <int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 10 500 roc_auc binary 0.853 10 0.00335 Preproces~
## 2 5 10 1000 roc_auc binary 0.853 10 0.00354 Preproces~
## 3 5 10 1500 roc_auc binary 0.853 10 0.00319 Preproces~
## 4 5 10 2000 roc_auc binary 0.853 10 0.00296 Preproces~
## 5 1 1 1500 roc_auc binary 0.848 10 0.00395 Preproces~
## 6 1 1 2000 roc_auc binary 0.848 10 0.00389 Preproces~
## 7 1 1 500 roc_auc binary 0.848 10 0.00388 Preproces~
## 8 1 1 1000 roc_auc binary 0.848 10 0.00393 Preproces~
## 9 1 0.1 1500 roc_auc binary 0.848 10 0.00411 Preproces~
## 10 1 0.1 1000 roc_auc binary 0.848 10 0.00410 Preproces~
## # ... with 38 more rows